1 Requirements

requirements=c("summarytools","plotly", "ggplot2", "plyr","dplyr", "lubridate","ggmap", "ggdensity", "patchwork")

for (req in requirements){
  if (!require(req, character.only = TRUE)){
      install.packages(req)
  }
}

2 Data Description

List of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year.

This is a breakdown of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year. This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning before being posted on the NYPD website. Each record represents a shooting incident in NYC and includes information about the event, the location and time of occurrence. In addition, information related to suspect and victim demographics is also included. This data can be used by the public to explore the nature of shooting/criminal activity. Please refer to the attached data footnotes for additional information about this dataset.

Each row is a Shooting Incident.

There are 21 columns:

  1. INCIDENT_KEY: Randomly generated persistent ID for each arrest (text)

  2. OCCUR_DATE: Exact date of the shooting incident (date)

  3. OCCUR_TIME: Exact time of the shooting incident (time)

  4. BORO: Borough where the shooting incident occurred (categorical)

  5. LOC_OF_OCCUR_DESC: where the shooting incident took place: “inside”, “outside” or not specified. (categorical)

  6. PRECINCT: Precinct where the shooting incident occurred. (categorical)

  7. JURISDICTION_CODE: Jurisdiction where the shooting incident occurred. Jurisdiction codes 0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst codes 3 and more represent non NYPD jurisdictions.

  8. LOC_CLASSFCTN_DESC: description of the classification of the location of the shooting incident: COMMERCIAL, DWELLING, HOUSING, OTHER, PARKING LOT, PLAYGROUND, STREET, TRANSIT, VEHICLE or not specified.

  9. LOCATION_DESC: Location of the shooting incident (categorical)

  10. STATISTICAL_MURDER_FLAG: Shooting resulted in the victim’s death which would be counted as a murder (categorical, binary)

  11. PERP_AGE_GROUP: Perpetrator’s age within a category (categorical)

  12. PERP_SEX: Perpetrator’s sex description (categorical)

  13. PERP_RACE: Perpetrator’s race description (categorical)

  14. VIC_AGE_GROUP: Victim’s age within a category (categorical)

  15. VIC_SEX: Victim’s sex description (categorical)

  16. VIC_RACE: Victim’s race description

  17. X_COORD_CD: Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)

  18. Y_COORD_CD: Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104) (text)

  19. Latitude: Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) (numerical)

  20. Longitude: Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) (numerical)

  21. Lon_Lat: Longitude and Latitude Coordinates for mapping (point)

link to the data: https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8

2.1 Analysis description

This analysis is based on a binary classification task: prediction of the victim survival based on the available information.

Specifically, I will answer the following questions:

  • Is survival more likely in a specific neighborhood?

  • Is it more likely to survive based on gender/age of the victim? (obviously younger victims should be more likely to survive.)

  • Is it more likely to survive based on the victim’s ethnicity?

  • Is it more likely to survive if the shooter belongs to a different/same ethnicity than the victim?

  • Is it more likely to survive based on sex/age/ethnicity of the perpetrator?

  • Is there a trend with respect to the date on which the incident occurred?

  • Is survival less likely in late hours?

  • Is it more likely to survive on a weekday?

  • What happened during the pandemic period?

3 Data exploration and cleaning

shootings = read.csv("data/NYPD_Shooting_Incident_Data__Historic__20231218.csv", stringsAsFactors = TRUE)
dim(shootings)
## [1] 27312    21
print(dfSummary(shootings), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 INCIDENT_KEY [integer]
Mean (sd) : 120860536 (73412859)
min ≤ med ≤ max:
9953245 ≤ 90372218 ≤ 261190187
IQR (CV) : 124949350 (0.6)
21420 distinct values 27312 (100.0%) 0 (0.0%)
2 OCCUR_DATE [factor]
1. 01/01/2006
2. 01/01/2007
3. 01/01/2008
4. 01/01/2009
5. 01/01/2010
6. 01/01/2011
7. 01/01/2012
8. 01/01/2013
9. 01/01/2014
10. 01/01/2015
[ 5751 others ]
8(0.0%)
18(0.1%)
19(0.1%)
7(0.0%)
8(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
12(0.0%)
3(0.0%)
27203(99.6%)
27312 (100.0%) 0 (0.0%)
3 OCCUR_TIME [factor]
1. 00:00:00
2. 00:01:00
3. 00:02:00
4. 00:03:00
5. 00:04:00
6. 00:05:00
7. 00:06:00
8. 00:07:00
9. 00:08:00
10. 00:09:00
[ 1411 others ]
6(0.0%)
67(0.2%)
39(0.1%)
28(0.1%)
40(0.1%)
69(0.3%)
23(0.1%)
24(0.1%)
25(0.1%)
32(0.1%)
26959(98.7%)
27312 (100.0%) 0 (0.0%)
4 BORO [factor]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. STATEN ISLAND
7937(29.1%)
10933(40.0%)
3572(13.1%)
4094(15.0%)
776(2.8%)
27312 (100.0%) 0 (0.0%)
5 LOC_OF_OCCUR_DESC [factor]
1. (Empty string)
2. INSIDE
3. OUTSIDE
25596(93.7%)
242(0.9%)
1474(5.4%)
27312 (100.0%) 0 (0.0%)
6 PRECINCT [integer]
Mean (sd) : 65.6 (27.3)
min ≤ med ≤ max:
1 ≤ 68 ≤ 123
IQR (CV) : 37 (0.4)
77 distinct values 27312 (100.0%) 0 (0.0%)
7 JURISDICTION_CODE [integer]
Mean (sd) : 0.3 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2
IQR (CV) : 0 (2.3)
0:22809(83.5%)
1:74(0.3%)
2:4427(16.2%)
27310 (100.0%) 2 (0.0%)
8 LOC_CLASSFCTN_DESC [factor]
1. (Empty string)
2. COMMERCIAL
3. DWELLING
4. HOUSING
5. OTHER
6. PARKING LOT
7. PLAYGROUND
8. STREET
9. TRANSIT
10. VEHICLE
25596(93.7%)
100(0.4%)
127(0.5%)
280(1.0%)
31(0.1%)
7(0.0%)
30(0.1%)
1103(4.0%)
15(0.1%)
23(0.1%)
27312 (100.0%) 0 (0.0%)
9 LOCATION_DESC [factor]
1. (Empty string)
2. (null)
3. ATM
4. BANK
5. BAR/NIGHT CLUB
6. BEAUTY/NAIL SALON
7. CANDY STORE
8. CHAIN STORE
9. CHECK CASH
10. CLOTHING BOUTIQUE
[ 31 others ]
14977(54.8%)
977(3.6%)
1(0.0%)
3(0.0%)
628(2.3%)
112(0.4%)
7(0.0%)
5(0.0%)
1(0.0%)
14(0.1%)
10587(38.8%)
27312 (100.0%) 0 (0.0%)
10 STATISTICAL_MURDER_FLAG [factor]
1. false
2. true
22046(80.7%)
5266(19.3%)
27312 (100.0%) 0 (0.0%)
11 PERP_AGE_GROUP [factor]
1. (Empty string)
2. (null)
3. <18
4. 1020
5. 18-24
6. 224
7. 25-44
8. 45-64
9. 65+
10. 940
11. UNKNOWN
9344(34.2%)
640(2.3%)
1591(5.8%)
1(0.0%)
6222(22.8%)
1(0.0%)
5687(20.8%)
617(2.3%)
60(0.2%)
1(0.0%)
3148(11.5%)
27312 (100.0%) 0 (0.0%)
12 PERP_SEX [factor]
1. (Empty string)
2. (null)
3. F
4. M
5. U
9310(34.1%)
640(2.3%)
424(1.6%)
15439(56.5%)
1499(5.5%)
27312 (100.0%) 0 (0.0%)
13 PERP_RACE [factor]
1. (Empty string)
2. (null)
3. AMERICAN INDIAN/ALASKAN N
4. ASIAN / PACIFIC ISLANDER
5. BLACK
6. BLACK HISPANIC
7. UNKNOWN
8. WHITE
9. WHITE HISPANIC
9310(34.1%)
640(2.3%)
2(0.0%)
154(0.6%)
11432(41.9%)
1314(4.8%)
1836(6.7%)
283(1.0%)
2341(8.6%)
27312 (100.0%) 0 (0.0%)
14 VIC_AGE_GROUP [factor]
1. <18
2. 1022
3. 18-24
4. 25-44
5. 45-64
6. 65+
7. UNKNOWN
2839(10.4%)
1(0.0%)
10086(36.9%)
12281(45.0%)
1863(6.8%)
181(0.7%)
61(0.2%)
27312 (100.0%) 0 (0.0%)
15 VIC_SEX [factor]
1. F
2. M
3. U
2615(9.6%)
24686(90.4%)
11(0.0%)
27312 (100.0%) 0 (0.0%)
16 VIC_RACE [factor]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
10(0.0%)
404(1.5%)
19439(71.2%)
2646(9.7%)
66(0.2%)
698(2.6%)
4049(14.8%)
27312 (100.0%) 0 (0.0%)
17 X_COORD_CD [numeric]
Mean (sd) : 1009449 (18377.8)
min ≤ med ≤ max:
914928.1 ≤ 1007731 ≤ 1066815
IQR (CV) : 16809.5 (0)
12088 distinct values 27312 (100.0%) 0 (0.0%)
18 Y_COORD_CD [numeric]
Mean (sd) : 208127.4 (31886.4)
min ≤ med ≤ max:
125756.7 ≤ 194486.6 ≤ 271127.7
IQR (CV) : 56684.1 (0.2)
12283 distinct values 27312 (100.0%) 0 (0.0%)
19 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
12609 distinct values 27302 (100.0%) 10 (0.0%)
20 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
12594 distinct values 27302 (100.0%) 10 (0.0%)
21 Lon_Lat [factor]
1. (Empty string)
2. POINT (-73.70204616699993
3. POINT (-73.70308024299999
4. POINT (-73.70659811199994
5. POINT (-73.70686574799998
6. POINT (-73.71563825499999
7. POINT (-73.71803289299999
8. POINT (-73.72199603099995
9. POINT (-73.72653338999999
10. POINT (-73.72739037499997
[ 12636 others ]
10(0.0%)
1(0.0%)
4(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
27289(99.9%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

For variables LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION_DESC, PERP_AGE_GROUP, PERP_SEX, PERP_RACE are present both ’’ and ‘(null)’ values: we consider these as missing values. Furthermore, I should transform INCIDENT_KEY, PRECINCT and JURISDICTION_CODE in factors.

shootings = read.csv("data/NYPD_Shooting_Incident_Data__Historic__20231218.csv", na.strings = c('','(null)'), stringsAsFactors = TRUE)
shootings$INCIDENT_KEY = as.factor(shootings$INCIDENT_KEY)
shootings$PRECINCT = as.factor(shootings$PRECINCT)
shootings$JURISDICTION_CODE = as.factor(shootings$JURISDICTION_CODE)
print(dfSummary(shootings), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 INCIDENT_KEY [factor]
1. 9953245
2. 9953246
3. 9953247
4. 9953248
5. 9953249
6. 9953250
7. 9953252
8. 9953255
9. 9953257
10. 9953258
[ 21410 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
2(0.0%)
1(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
27300(100.0%)
27312 (100.0%) 0 (0.0%)
2 OCCUR_DATE [factor]
1. 01/01/2006
2. 01/01/2007
3. 01/01/2008
4. 01/01/2009
5. 01/01/2010
6. 01/01/2011
7. 01/01/2012
8. 01/01/2013
9. 01/01/2014
10. 01/01/2015
[ 5751 others ]
8(0.0%)
18(0.1%)
19(0.1%)
7(0.0%)
8(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
12(0.0%)
3(0.0%)
27203(99.6%)
27312 (100.0%) 0 (0.0%)
3 OCCUR_TIME [factor]
1. 00:00:00
2. 00:01:00
3. 00:02:00
4. 00:03:00
5. 00:04:00
6. 00:05:00
7. 00:06:00
8. 00:07:00
9. 00:08:00
10. 00:09:00
[ 1411 others ]
6(0.0%)
67(0.2%)
39(0.1%)
28(0.1%)
40(0.1%)
69(0.3%)
23(0.1%)
24(0.1%)
25(0.1%)
32(0.1%)
26959(98.7%)
27312 (100.0%) 0 (0.0%)
4 BORO [factor]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. STATEN ISLAND
7937(29.1%)
10933(40.0%)
3572(13.1%)
4094(15.0%)
776(2.8%)
27312 (100.0%) 0 (0.0%)
5 LOC_OF_OCCUR_DESC [factor]
1. INSIDE
2. OUTSIDE
242(14.1%)
1474(85.9%)
1716 (6.3%) 25596 (93.7%)
6 PRECINCT [factor]
1. 1
2. 5
3. 6
4. 7
5. 9
6. 10
7. 13
8. 14
9. 17
10. 18
[ 67 others ]
25(0.1%)
58(0.2%)
28(0.1%)
109(0.4%)
109(0.4%)
73(0.3%)
60(0.2%)
56(0.2%)
10(0.0%)
34(0.1%)
26750(97.9%)
27312 (100.0%) 0 (0.0%)
7 JURISDICTION_CODE [factor]
1. 0
2. 1
3. 2
22809(83.5%)
74(0.3%)
4427(16.2%)
27310 (100.0%) 2 (0.0%)
8 LOC_CLASSFCTN_DESC [factor]
1. COMMERCIAL
2. DWELLING
3. HOUSING
4. OTHER
5. PARKING LOT
6. PLAYGROUND
7. STREET
8. TRANSIT
9. VEHICLE
100(5.8%)
127(7.4%)
280(16.3%)
31(1.8%)
7(0.4%)
30(1.7%)
1103(64.3%)
15(0.9%)
23(1.3%)
1716 (6.3%) 25596 (93.7%)
9 LOCATION_DESC [factor]
1. ATM
2. BANK
3. BAR/NIGHT CLUB
4. BEAUTY/NAIL SALON
5. CANDY STORE
6. CHAIN STORE
7. CHECK CASH
8. CLOTHING BOUTIQUE
9. COMMERCIAL BLDG
10. DEPT STORE
[ 29 others ]
1(0.0%)
3(0.0%)
628(5.5%)
112(1.0%)
7(0.1%)
5(0.0%)
1(0.0%)
14(0.1%)
292(2.6%)
9(0.1%)
10286(90.6%)
11358 (41.6%) 15954 (58.4%)
10 STATISTICAL_MURDER_FLAG [factor]
1. false
2. true
22046(80.7%)
5266(19.3%)
27312 (100.0%) 0 (0.0%)
11 PERP_AGE_GROUP [factor]
1. <18
2. 1020
3. 18-24
4. 224
5. 25-44
6. 45-64
7. 65+
8. 940
9. UNKNOWN
1591(9.2%)
1(0.0%)
6222(35.9%)
1(0.0%)
5687(32.8%)
617(3.6%)
60(0.3%)
1(0.0%)
3148(18.2%)
17328 (63.4%) 9984 (36.6%)
12 PERP_SEX [factor]
1. F
2. M
3. U
424(2.4%)
15439(88.9%)
1499(8.6%)
17362 (63.6%) 9950 (36.4%)
13 PERP_RACE [factor]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
2(0.0%)
154(0.9%)
11432(65.8%)
1314(7.6%)
1836(10.6%)
283(1.6%)
2341(13.5%)
17362 (63.6%) 9950 (36.4%)
14 VIC_AGE_GROUP [factor]
1. <18
2. 1022
3. 18-24
4. 25-44
5. 45-64
6. 65+
7. UNKNOWN
2839(10.4%)
1(0.0%)
10086(36.9%)
12281(45.0%)
1863(6.8%)
181(0.7%)
61(0.2%)
27312 (100.0%) 0 (0.0%)
15 VIC_SEX [factor]
1. F
2. M
3. U
2615(9.6%)
24686(90.4%)
11(0.0%)
27312 (100.0%) 0 (0.0%)
16 VIC_RACE [factor]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
10(0.0%)
404(1.5%)
19439(71.2%)
2646(9.7%)
66(0.2%)
698(2.6%)
4049(14.8%)
27312 (100.0%) 0 (0.0%)
17 X_COORD_CD [numeric]
Mean (sd) : 1009449 (18377.8)
min ≤ med ≤ max:
914928.1 ≤ 1007731 ≤ 1066815
IQR (CV) : 16809.5 (0)
12088 distinct values 27312 (100.0%) 0 (0.0%)
18 Y_COORD_CD [numeric]
Mean (sd) : 208127.4 (31886.4)
min ≤ med ≤ max:
125756.7 ≤ 194486.6 ≤ 271127.7
IQR (CV) : 56684.1 (0.2)
12283 distinct values 27312 (100.0%) 0 (0.0%)
19 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
12609 distinct values 27302 (100.0%) 10 (0.0%)
20 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
12594 distinct values 27302 (100.0%) 10 (0.0%)
21 Lon_Lat [factor]
1. POINT (-73.70204616699993
2. POINT (-73.70308024299999
3. POINT (-73.70659811199994
4. POINT (-73.70686574799998
5. POINT (-73.71563825499999
6. POINT (-73.71803289299999
7. POINT (-73.72199603099995
8. POINT (-73.72653338999999
9. POINT (-73.72739037499997
10. POINT (-73.727478 40.6883
[ 12635 others ]
1(0.0%)
4(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
27288(99.9%)
27302 (100.0%) 10 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

3.1 Response

The response variable STATISTICAL_MURDER_FLAG is unbalanced: only 19.3% of the shooting incidents are also murders. For better visualization I decided to rename it along with its levels:

#rename murder
shootings$murder <- shootings$STATISTICAL_MURDER_FLAG
levels(shootings$murder) <- c("FALSE", "TRUE")

shootings$STATISTICAL_MURDER_FLAG <- NULL

Furthermore I create another variable in probability format:

#create response
shootings$murder_prob <- shootings$murder
levels(shootings$murder_prob) <- c(0,1)
shootings$murder_prob <- as.numeric(as.character(shootings$murder_prob))

3.2 Missing values

As specified in NYPD Shooting Incident Level Data Footnotes (pdf file in data folder), null values should be considered as either “Unknown/Not Available/Not Reported”. Thus I consider missing values as ‘Unknown’. For some predictors the ‘Unknown’ level is already specified thus I merge the two levels.

NA_as_unknown <- function(predictor, new_name="UNKNOWN"){
  lev=c(levels(predictor), new_name)
  return(factor( ifelse(is.na(predictor), new_name, predictor), labels = lev))
}

shootings$LOC_OF_OCCUR_DESC <- NA_as_unknown(shootings$LOC_OF_OCCUR_DESC)
shootings$JURISDICTION_CODE <- NA_as_unknown(shootings$JURISDICTION_CODE)
shootings$LOC_CLASSFCTN_DESC <- NA_as_unknown(shootings$LOC_CLASSFCTN_DESC)
shootings$LOCATION_DESC <- NA_as_unknown(shootings$LOCATION_DESC)
shootings$PERP_AGE_GROUP <- NA_as_unknown(shootings$PERP_AGE_GROUP)
shootings$PERP_SEX <- NA_as_unknown(shootings$PERP_SEX, "U")
shootings$PERP_RACE <- NA_as_unknown(shootings$PERP_RACE)

The “UNKNOWN” levels are used as reference for all the factors for better visualization. Those levels will be dropped in the following sections before using any model.

shootings$LOC_OF_OCCUR_DESC <- relevel(shootings$LOC_OF_OCCUR_DESC, "UNKNOWN")
shootings$JURISDICTION_CODE <- relevel(shootings$JURISDICTION_CODE, "UNKNOWN")
shootings$LOC_CLASSFCTN_DESC <- relevel(shootings$LOC_CLASSFCTN_DESC, "UNKNOWN")
shootings$LOCATION_DESC <- relevel(shootings$LOCATION_DESC, "UNKNOWN")
shootings$PERP_AGE_GROUP <- relevel(shootings$PERP_AGE_GROUP, "UNKNOWN")
shootings$PERP_SEX <- relevel(shootings$PERP_SEX, "U")
shootings$PERP_RACE <- relevel(shootings$PERP_RACE, "UNKNOWN")
shootings$VIC_AGE_GROUP <- relevel(shootings$VIC_AGE_GROUP, "UNKNOWN")
shootings$VIC_SEX <- relevel(shootings$VIC_SEX, "U")
shootings$VIC_RACE <- relevel(shootings$VIC_RACE, "UNKNOWN")

3.3 Temporal data

In this data set are present 2 variable which gives information on the time and date of the incident, respectively: OCCUR_TIME and OCCUR_DATE.

print(dfSummary(shootings[c("OCCUR_TIME","OCCUR_DATE")]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 OCCUR_TIME [factor]
1. 00:00:00
2. 00:01:00
3. 00:02:00
4. 00:03:00
5. 00:04:00
6. 00:05:00
7. 00:06:00
8. 00:07:00
9. 00:08:00
10. 00:09:00
[ 1411 others ]
6(0.0%)
67(0.2%)
39(0.1%)
28(0.1%)
40(0.1%)
69(0.3%)
23(0.1%)
24(0.1%)
25(0.1%)
32(0.1%)
26959(98.7%)
27312 (100.0%) 0 (0.0%)
2 OCCUR_DATE [factor]
1. 01/01/2006
2. 01/01/2007
3. 01/01/2008
4. 01/01/2009
5. 01/01/2010
6. 01/01/2011
7. 01/01/2012
8. 01/01/2013
9. 01/01/2014
10. 01/01/2015
[ 5751 others ]
8(0.0%)
18(0.1%)
19(0.1%)
7(0.0%)
8(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
12(0.0%)
3(0.0%)
27203(99.6%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Both variables have a lot of levels. It is reasonable to aggregate some levels of these variables or extract useful information from them.

3.3.1 Time

For variable OCCUR_TIME i decided to create a variable describing the period of the day:

  • early morning: from 05:00 to 8:59

  • morning: from 09:00 to 12:59

  • early afternoon: from 13:00 to 15:59

  • afternoon: from 16:00 to 19:59

  • evening: from 20:00 to 22:59

  • night: from 23:00 to 04:00

#convert to character
shootings$hour <- as.character(shootings$OCCUR_TIME)

#split by ":"
shootings$hour <- strsplit(shootings$hour, ":")

shootings$hour <- rapply(shootings$hour, function(x){ x[1] })

shootings$hour <- as.factor(shootings$hour)
shootings$day_period <- rapply( as.list(as.numeric(as.character(shootings$hour))), function(x){

  y <- "Night"
    
  if (x>=5 & x<=8){
    y <- "EarlyMorning"
  }else if(x>=9 & x<=12){
    y <- "Morning"
  }else if(x>=13 & x<=15){
    y <- "EarlyAfternoon"
  }else if(x>=16 & x<=19){
    y <- "Afternoon"
  }else if(x>=20 & x<=22){
    y <- "Evening"
  }
  
  y
})

shootings$day_period <- as.factor(shootings$day_period)

shootings$day_period <- factor(shootings$day_period, levels=c('EarlyMorning', 'Morning', 'EarlyAfternoon', 'Afternoon', 'Evening', 'Night'))

Could be also useful to have a variable describing whenever the hour is a work hour or not:

shootings$working_hour <- rapply( as.list(as.numeric(as.character(shootings$hour))), function(x){

  y <- "FALSE"
    
  if (x>=9 & x<=17){
    y <- "TRUE"
  }
  
  y
})

shootings$working_hour <- as.factor(shootings$working_hour)
shootings$hour <- NULL

3.3.2 Date

For variable OCCUR_DATE i decided to extract useful information:

  • the year

  • the day of the year

  • the day of the week (categorical)

  • whatever the day is a working day (Monday to Friday) or not (Saturday and Sunday)

#convert to character
dates <- as.character(shootings$OCCUR_DATE)

#convert to date
dates.date <- as.Date(dates, format = "%m/%d/%Y") 

#create weekday variable
shootings$week_day <- as.factor(weekdays(dates.date))
shootings$week_day <- factor(shootings$week_day, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))

#create working_day variable
shootings$working_day <- rapply( as.list(as.character(shootings$week_day)), function(x){
  
  y <- "TRUE"
    
  if (x=="Sunday" | x=="Saturday"){
    y <- "FALSE"
  }
  
  y
})

shootings$working_day <- as.factor(shootings$working_day)

shootings$year <- year(dates.date)
shootings$day_year <- yday(dates.date)

Could be also useful to have information specific on the COVID pandemic. Thus I created a two variable describing:

  • whatever the day occurred during COVID lock-down in New York period or not. The period I considered was from 2020-03-22 to 2020-9-30 (I considered both the “Stay-at-home order” and the entire “Four-phase reopening plan”, for addiction information see: https://en.wikipedia.org/wiki/COVID-19_pandemic_in_New_York_City ).

  • whatever the day occurred during COVID pandemic or not. The period I considered was from 2020-03-11 (WHO declares the pandemic) to 2022-12-31 (most recent date in data set). For additional information see: https://en.wikipedia.org/wiki/COVID-19_pandemic

shootings$COVID_lockdown <- rapply( as.list(dates.date), function(x){

  y <- "FALSE"
  
  if (as.Date("2020-03-22")<=x & x<=as.Date("2020-9-30")){
    y <- "TRUE"
  }
  
  y
})

shootings$COVID_pandemic <- rapply( as.list(dates.date), function(x){

  y <- "FALSE"
  
  if (as.Date("2020-03-11")<=x & x<=as.Date("2022-12-31")){
    y <- "TRUE"
  }
  
  y
})

shootings$COVID_lockdown <- as.factor(shootings$COVID_lockdown)
shootings$COVID_pandemic <- as.factor(shootings$COVID_pandemic)
dates.date<-NULL
dates<-NULL

3.3.3 Working Day and Working Hour

Could be interesting to create a unique variable which gives information about working day and working hour. This new variable will substitute the old “working hour”.

shootings$working <-  as.logical(shootings$working_hour) & as.logical(shootings$working_day)

shootings$working_hour <- as.factor(shootings$working)

shootings$working_day <- NULL
shootings$working <- NULL

3.4 Location data

In this data set are present 3 types of data which gives information on the information of the incident:

  • Geographical data:
    • X_COORD_CD: Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104).
    • Y_COORD_CD: Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104).
    • Latitude: Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326).
    • Longitude: Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326).
  • Location description data:
    • LOC_OF_OCCUR_DESC: whatever the incident occurred inside or outside.
    • LOC_CLASSFCTN_DESC: description of the incident location within categories.
    • LOCATION_DESC: general description of incident location.
  • City location data:
    • BORO: borough where the shooting incident occurred.
    • PRECINCT: precinct where the shooting incident occurred.

3.4.1 Geographical data

Since pairs of variables (X_COORD_CD, Y_COORD_CD) and (Latitude, Longitude) represent the same information in different scales, I decided to use (Latitude, Longitude) because it is the most known format. Since there are only 10 rows with missing Latitude and Longitude I decided to covert them manually using: https://epsg.io/transform#s_srs=2263&t_srs=4326&x=988902.0000000&y=192641.0000000. I also removed the predictor ‘Lon_Lat’, which is redundant.

na_rows <- which(rowSums(is.na(shootings)) > 0)

shootings[na_rows,c("X_COORD_CD","Y_COORD_CD", "Latitude", "Longitude")]
##       X_COORD_CD Y_COORD_CD Latitude Longitude
## 1407      998002     196692       NA        NA
## 25598     990784     149362       NA        NA
## 25599    1002173     249401       NA        NA
## 25833    1019164     210169       NA        NA
## 25939     995122     155693       NA        NA
## 26274     997407     233806       NA        NA
## 26742     997407     233806       NA        NA
## 26815    1001891     245600       NA        NA
## 26876    1041717     197008       NA        NA
## 27206     988902     192641       NA        NA
insert_LatLong<-function(index, Longitude, Latitude){
  
  shootings[index,]$Latitude <- Latitude
  shootings[index,]$Longitude <- Longitude
  
  assign('shootings',shootings,envir=.GlobalEnv)
}

insert_LatLong(1407, -73.9503992, 40.7065397)
insert_LatLong(25598, -73.9764791, 40.5766375)
insert_LatLong(25599, -73.9352145, 40.8512045)
insert_LatLong(25833, -73.8740021, 40.7434723)
insert_LatLong(25939, -73.960853, 40.5940105)
insert_LatLong(26274, -73.9524724, 40.808409)
insert_LatLong(26742, -73.9524724, 40.808409)
insert_LatLong(26815, -73.9362438, 40.8407724)
insert_LatLong(26876, -73.7927256, 40.7072308)
insert_LatLong(27206, -73.9832238, 40.6954301)
shootings$Lon_Lat <- NULL
shootings$X_COORD_CD <- NULL
shootings$Y_COORD_CD <- NULL

na_rows <- NULL

3.4.2 Location description data

print(dfSummary(shootings[,c('LOC_OF_OCCUR_DESC', 'LOC_CLASSFCTN_DESC', 'LOCATION_DESC')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 LOC_OF_OCCUR_DESC [factor]
1. UNKNOWN
2. INSIDE
3. OUTSIDE
25596(93.7%)
242(0.9%)
1474(5.4%)
27312 (100.0%) 0 (0.0%)
2 LOC_CLASSFCTN_DESC [factor]
1. UNKNOWN
2. COMMERCIAL
3. DWELLING
4. HOUSING
5. OTHER
6. PARKING LOT
7. PLAYGROUND
8. STREET
9. TRANSIT
10. VEHICLE
25596(93.7%)
100(0.4%)
127(0.5%)
280(1.0%)
31(0.1%)
7(0.0%)
30(0.1%)
1103(4.0%)
15(0.1%)
23(0.1%)
27312 (100.0%) 0 (0.0%)
3 LOCATION_DESC [factor]
1. UNKNOWN
2. ATM
3. BANK
4. BAR/NIGHT CLUB
5. BEAUTY/NAIL SALON
6. CANDY STORE
7. CHAIN STORE
8. CHECK CASH
9. CLOTHING BOUTIQUE
10. COMMERCIAL BLDG
[ 30 others ]
15954(58.4%)
1(0.0%)
9(0.0%)
1(0.0%)
14(0.1%)
31(0.1%)
8(0.0%)
104(0.4%)
71(0.3%)
694(2.5%)
10425(38.2%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Both LOC_OF_OCCUR_DESC and LOC_CLASSFCTN_DESC are unknown in 93.7% of the times. While LOCATION_DESC only in 58.4%.

Lets explore ‘LOCATION_DESC’ levels:

levels(shootings$LOCATION_DESC)
##  [1] "UNKNOWN"                   "ATM"                      
##  [3] "BANK"                      "BAR/NIGHT CLUB"           
##  [5] "BEAUTY/NAIL SALON"         "CANDY STORE"              
##  [7] "CHAIN STORE"               "CHECK CASH"               
##  [9] "CLOTHING BOUTIQUE"         "COMMERCIAL BLDG"          
## [11] "DEPT STORE"                "DOCTOR/DENTIST"           
## [13] "DRUG STORE"                "DRY CLEANER/LAUNDRY"      
## [15] "FACTORY/WAREHOUSE"         "FAST FOOD"                
## [17] "GAS STATION"               "GROCERY/BODEGA"           
## [19] "GYM/FITNESS FACILITY"      "HOSPITAL"                 
## [21] "HOTEL/MOTEL"               "JEWELRY STORE"            
## [23] "LIQUOR STORE"              "LOAN COMPANY"             
## [25] "MULTI DWELL - APT BUILD"   "MULTI DWELL - PUBLIC HOUS"
## [27] "NONE"                      "PHOTO/COPY STORE"         
## [29] "PVT HOUSE"                 "RESTAURANT/DINER"         
## [31] "SCHOOL"                    "SHOE STORE"               
## [33] "SMALL MERCHANT"            "SOCIAL CLUB/POLICY LOCATI"
## [35] "STORAGE FACILITY"          "STORE UNCLASSIFIED"       
## [37] "SUPERMARKET"               "TELECOMM. STORE"          
## [39] "VARIETY STORE"             "VIDEO STORE"

I should consider level ‘NONE’ as ‘UNKNOWN’:

shootings[shootings$LOCATION_DESC=="NONE", "LOCATION_DESC"] <- "UNKNOWN"

shootings$LOCATION_DESC <- droplevels(shootings$LOCATION_DESC)

levels(shootings$LOCATION_DESC)
##  [1] "UNKNOWN"                   "ATM"                      
##  [3] "BANK"                      "BAR/NIGHT CLUB"           
##  [5] "BEAUTY/NAIL SALON"         "CANDY STORE"              
##  [7] "CHAIN STORE"               "CHECK CASH"               
##  [9] "CLOTHING BOUTIQUE"         "COMMERCIAL BLDG"          
## [11] "DEPT STORE"                "DOCTOR/DENTIST"           
## [13] "DRUG STORE"                "DRY CLEANER/LAUNDRY"      
## [15] "FACTORY/WAREHOUSE"         "FAST FOOD"                
## [17] "GAS STATION"               "GROCERY/BODEGA"           
## [19] "GYM/FITNESS FACILITY"      "HOSPITAL"                 
## [21] "HOTEL/MOTEL"               "JEWELRY STORE"            
## [23] "LIQUOR STORE"              "LOAN COMPANY"             
## [25] "MULTI DWELL - APT BUILD"   "MULTI DWELL - PUBLIC HOUS"
## [27] "PHOTO/COPY STORE"          "PVT HOUSE"                
## [29] "RESTAURANT/DINER"          "SCHOOL"                   
## [31] "SHOE STORE"                "SMALL MERCHANT"           
## [33] "SOCIAL CLUB/POLICY LOCATI" "STORAGE FACILITY"         
## [35] "STORE UNCLASSIFIED"        "SUPERMARKET"              
## [37] "TELECOMM. STORE"           "VARIETY STORE"            
## [39] "VIDEO STORE"

Variables LOC_OF_OCCUR_DESC and LOC_CLASSFCTN_DESC are always both known or both unknown:

shootings_locations <- shootings[,c('LOC_OF_OCCUR_DESC', 'LOC_CLASSFCTN_DESC', 'LOCATION_DESC')]
shootings_locations[(shootings$LOC_OF_OCCUR_DESC =='UNKNOWN' & shootings$LOC_CLASSFCTN_DESC != 'UNKNOWN') | (shootings$LOC_OF_OCCUR_DESC !='UNKNOWN' & shootings$LOC_CLASSFCTN_DESC == 'UNKNOWN'), ]
## [1] LOC_OF_OCCUR_DESC  LOC_CLASSFCTN_DESC LOCATION_DESC     
## <0 rows> (or 0-length row.names)

Maybe they can give additional information on LOCATION_DESC when its value is unknown:

print(dfSummary(shootings_locations[shootings$LOC_OF_OCCUR_DESC !='UNKNOWN' & shootings$LOC_CLASSFCTN_DESC != 'UNKNOWN' & shootings$LOCATION_DESC == 'UNKNOWN', ]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 LOC_OF_OCCUR_DESC [factor]
1. UNKNOWN
2. INSIDE
3. OUTSIDE
0(0.0%)
40(4.0%)
949(96.0%)
989 (100.0%) 0 (0.0%)
2 LOC_CLASSFCTN_DESC [factor]
1. UNKNOWN
2. COMMERCIAL
3. DWELLING
4. HOUSING
5. OTHER
6. PARKING LOT
7. PLAYGROUND
8. STREET
9. TRANSIT
10. VEHICLE
0(0.0%)
24(2.4%)
8(0.8%)
2(0.2%)
12(1.2%)
1(0.1%)
30(3.0%)
874(88.4%)
15(1.5%)
23(2.3%)
989 (100.0%) 0 (0.0%)
3 LOCATION_DESC [factor]
1. UNKNOWN
2. ATM
3. BANK
4. BAR/NIGHT CLUB
5. BEAUTY/NAIL SALON
6. CANDY STORE
7. CHAIN STORE
8. CHECK CASH
9. CLOTHING BOUTIQUE
10. COMMERCIAL BLDG
[ 29 others ]
989(100.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
0(0.0%)
989 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

When LOCATION_DESC is unknown I used LOC_CLASSFCTN_DESC and LOC_OF_OCCUR_DESC information to fill UNKNOWN level of LOCATION_DESC, in particular I take the mode of LOCATION_DESC when LOC_CLASSFCTN_DESC and LOC_OF_OCCUR_DESC take a specific value and use this value in place of UNKNOWN for LOCATION_DESC. If this kind of procedure is not possible (LOCATION_DESC is only unknown for a specific pair of values) the value of LOCATION_DESC remains untouched.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode_O_I <- function(x){
  mode_o <- Mode(shootings_locations[shootings_locations$LOC_OF_OCCUR_DESC == 'OUTSIDE' & shootings_locations$LOC_CLASSFCTN_DESC == x & shootings_locations$LOCATION_DESC != 'UNKNOWN', ]$LOCATION_DESC)

mode_i <-Mode(shootings_locations[shootings_locations$LOC_OF_OCCUR_DESC == 'INSIDE' & shootings_locations$LOC_CLASSFCTN_DESC == x & shootings_locations$LOCATION_DESC != 'UNKNOWN', ]$LOCATION_DESC)

return(c(mode_o, mode_i))
}

modify_UNKNOWN_loc <-function(level_name_LOC_OF_OCCUR_DESC, level_name_LOC_CLASSFCTN_DESC,  new_value){
  
  rows <- dim(shootings[shootings$LOC_OF_OCCUR_DESC == level_name_LOC_OF_OCCUR_DESC & shootings$LOC_CLASSFCTN_DESC == level_name_LOC_CLASSFCTN_DESC & shootings$LOCATION_DESC == 'UNKNOWN',])[1]
  
  if (!is.na(new_value)) {
    shootings[shootings$LOC_OF_OCCUR_DESC == level_name_LOC_OF_OCCUR_DESC & shootings$LOC_CLASSFCTN_DESC == level_name_LOC_CLASSFCTN_DESC & shootings$LOCATION_DESC == 'UNKNOWN', "LOCATION_DESC"]<-new_value
    
    print(paste("modified rows:",rows))
  }
  else{
    print(paste("untouched rows:", rows))
  }
  assign('shootings',shootings,envir=.GlobalEnv)
}

infer_LOCATION_DESC <- function(level_name_LOC_CLASSFCTN_DESC){
  modes <- Mode_O_I(level_name)
  
  print(modes)
  
  modify_UNKNOWN_loc("OUTSIDE", level_name_LOC_CLASSFCTN_DESC, modes[1])
  
  modify_UNKNOWN_loc("INSIDE", level_name_LOC_CLASSFCTN_DESC, modes[2])
  
}
  1. ‘COMMERCIAL’ level
level_name <- "COMMERCIAL"
infer_LOCATION_DESC(level_name)
## [1] COMMERCIAL BLDG COMMERCIAL BLDG
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 15"
## [1] "modified rows: 9"
  1. ‘DWELLING’ level
level_name <- "DWELLING"
infer_LOCATION_DESC(level_name)
## [1] GROCERY/BODEGA GROCERY/BODEGA
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 2"
## [1] "modified rows: 6"
  1. ‘HOUSING’ level
level_name <- "HOUSING"
infer_LOCATION_DESC(level_name)
## [1] GYM/FITNESS FACILITY GYM/FITNESS FACILITY
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 1"
## [1] "modified rows: 1"
  1. ‘OTHER’ level
level_name <- "OTHER"
infer_LOCATION_DESC(level_name)
## [1] DOCTOR/DENTIST DOCTOR/DENTIST
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 12"
## [1] "modified rows: 0"
  1. ‘PARKING LOT’ level
level_name <- "PARKING LOT"
infer_LOCATION_DESC(level_name)
## [1] VIDEO STORE <NA>       
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 1"
## [1] "untouched rows: 0"
  1. ‘PLAYGROUND’ level
level_name <- "PLAYGROUND"
infer_LOCATION_DESC(level_name)
## [1] <NA> <NA>
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "untouched rows: 28"
## [1] "untouched rows: 2"
  1. ‘STREET’ level
level_name <- "STREET"
infer_LOCATION_DESC(level_name)
## [1] GROCERY/BODEGA GROCERY/BODEGA
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "modified rows: 864"
## [1] "modified rows: 10"
  1. ‘TRANSIT’ level
level_name <- "TRANSIT"
infer_LOCATION_DESC(level_name)
## [1] <NA> <NA>
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "untouched rows: 5"
## [1] "untouched rows: 10"
  1. ‘VEHICLE’ level
level_name <- "VEHICLE"
infer_LOCATION_DESC(level_name)
## [1] <NA> <NA>
## 39 Levels: UNKNOWN ATM BANK BAR/NIGHT CLUB BEAUTY/NAIL SALON ... VIDEO STORE
## [1] "untouched rows: 21"
## [1] "untouched rows: 2"

Then I simply remove LOC_CLASSFCTN_DESC and LOC_OF_OCCUR_DESC:

shootings$LOC_CLASSFCTN_DESC <- NULL
shootings$LOC_OF_OCCUR_DESC <- NULL
shootings_locations<-NULL

Now lets continue exploring LOCATION_DESC variable:

print(dfSummary(shootings$LOCATION_DESC,max.distinct.values = 50), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 LOCATION_DESC [factor]
1. UNKNOWN
2. ATM
3. BANK
4. BAR/NIGHT CLUB
5. BEAUTY/NAIL SALON
6. CANDY STORE
7. CHAIN STORE
8. CHECK CASH
9. CLOTHING BOUTIQUE
10. COMMERCIAL BLDG
11. DEPT STORE
12. DOCTOR/DENTIST
13. DRUG STORE
14. DRY CLEANER/LAUNDRY
15. FACTORY/WAREHOUSE
16. FAST FOOD
17. GAS STATION
18. GROCERY/BODEGA
19. GYM/FITNESS FACILITY
20. HOSPITAL
21. HOTEL/MOTEL
22. JEWELRY STORE
23. LIQUOR STORE
24. LOAN COMPANY
25. MULTI DWELL - APT BUILD
26. MULTI DWELL - PUBLIC HOUS
27. PHOTO/COPY STORE
28. PVT HOUSE
29. RESTAURANT/DINER
30. SCHOOL
31. SHOE STORE
32. SMALL MERCHANT
33. SOCIAL CLUB/POLICY LOCATI
34. STORAGE FACILITY
35. STORE UNCLASSIFIED
36. SUPERMARKET
37. TELECOMM. STORE
38. VARIETY STORE
39. VIDEO STORE
15070(55.2%)
1(0.0%)
9(0.0%)
1(0.0%)
14(0.1%)
31(0.1%)
8(0.0%)
104(0.4%)
71(0.3%)
718(2.6%)
3(0.0%)
77(0.3%)
3(0.0%)
35(0.1%)
12(0.0%)
41(0.2%)
1(0.0%)
3717(13.6%)
4834(17.7%)
175(0.6%)
1(0.0%)
951(3.5%)
204(0.7%)
628(2.3%)
1(0.0%)
10(0.0%)
72(0.3%)
1(0.0%)
36(0.1%)
21(0.1%)
11(0.0%)
11(0.0%)
8(0.0%)
112(0.4%)
7(0.0%)
5(0.0%)
1(0.0%)
14(0.1%)
293(1.1%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

This variable has a lot of levels with low frequencies; it is reasonable to aggregate them. I decided to aggregate levels whose frequencies are below or equal to 0.8% of the rows.

aggregate_levels_factor<-function(factor_to_aggregate, new_level_name, perc){
  
  levels(shootings[, factor_to_aggregate]) <- c(levels(shootings[, factor_to_aggregate]), new_level_name) #create "OTHER" level
  
  levels_table <- table(shootings[ ,factor_to_aggregate])
  
  for (i in 1:dim(levels_table)) {
    
    if (levels_table[i]*100/dim(shootings)[1] <= perc){
      shootings[shootings[, factor_to_aggregate]==rownames(levels_table)[i], factor_to_aggregate] <- new_level_name
    }
  }
  
  shootings[,factor_to_aggregate] <- droplevels(shootings[ , factor_to_aggregate])
  
  print(levels(shootings[,factor_to_aggregate]))
  
  assign('shootings',shootings,envir=.GlobalEnv)
}

aggregate_levels_factor("LOCATION_DESC", "OTHER", 0.8)
## [1] "UNKNOWN"              "COMMERCIAL BLDG"      "GROCERY/BODEGA"      
## [4] "GYM/FITNESS FACILITY" "JEWELRY STORE"        "LOAN COMPANY"        
## [7] "VIDEO STORE"          "OTHER"
shootings$location_desc <- shootings$LOCATION_DESC
shootings$LOCATION_DESC <- NULL

print(dfSummary(shootings$location_desc,max.distinct.values = 50), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 location_desc [factor]
1. UNKNOWN
2. COMMERCIAL BLDG
3. GROCERY/BODEGA
4. GYM/FITNESS FACILITY
5. JEWELRY STORE
6. LOAN COMPANY
7. VIDEO STORE
8. OTHER
15070(55.2%)
718(2.6%)
3717(13.6%)
4834(17.7%)
951(3.5%)
628(2.3%)
293(1.1%)
1101(4.0%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

As we can see, we have still 55.2% of the location description data as UNKNOWN.

3.4.3 City location data

The city of New York has 77 police precincts, as shown in picture below.

NYC boroughs and precincts
NYC boroughs and precincts

Thus variables PRECINT has a lot of levels with low frequencies, while variable BORO has only 5 levels.

print(dfSummary(shootings[,c("PRECINCT", "BORO")], max.distinct.values = 77), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 PRECINCT [factor]
1. 1
2. 5
3. 6
4. 7
5. 9
6. 10
7. 13
8. 14
9. 17
10. 18
11. 19
12. 20
13. 22
14. 23
15. 24
16. 25
17. 26
18. 28
19. 30
20. 32
21. 33
22. 34
23. 40
24. 41
25. 42
26. 43
27. 44
28. 45
29. 46
30. 47
31. 48
32. 49
33. 50
34. 52
35. 60
36. 61
37. 62
38. 63
39. 66
40. 67
41. 68
42. 69
43. 70
44. 71
45. 72
46. 73
47. 75
48. 76
49. 77
50. 78
51. 79
52. 81
53. 83
54. 84
55. 88
56. 90
57. 94
58. 100
59. 101
60. 102
61. 103
62. 104
63. 105
64. 106
65. 107
66. 108
67. 109
68. 110
69. 111
70. 112
71. 113
72. 114
73. 115
74. 120
75. 121
76. 122
77. 123
25(0.1%)
58(0.2%)
28(0.1%)
109(0.4%)
109(0.4%)
73(0.3%)
60(0.2%)
56(0.2%)
10(0.0%)
34(0.1%)
20(0.1%)
40(0.1%)
1(0.0%)
487(1.8%)
105(0.4%)
461(1.7%)
149(0.5%)
343(1.3%)
229(0.8%)
634(2.3%)
225(0.8%)
316(1.2%)
908(3.3%)
494(1.8%)
850(3.1%)
758(2.8%)
1020(3.7%)
182(0.7%)
895(3.3%)
953(3.5%)
787(2.9%)
353(1.3%)
154(0.6%)
583(2.1%)
372(1.4%)
153(0.6%)
70(0.3%)
282(1.0%)
46(0.2%)
1216(4.5%)
32(0.1%)
466(1.7%)
459(1.7%)
579(2.1%)
109(0.4%)
1452(5.3%)
1557(5.7%)
167(0.6%)
795(2.9%)
62(0.2%)
1012(3.7%)
799(2.9%)
500(1.8%)
124(0.5%)
280(1.0%)
315(1.2%)
86(0.3%)
170(0.6%)
489(1.8%)
210(0.8%)
593(2.2%)
102(0.4%)
479(1.8%)
224(0.8%)
101(0.4%)
67(0.2%)
115(0.4%)
160(0.6%)
11(0.0%)
23(0.1%)
802(2.9%)
369(1.4%)
179(0.7%)
572(2.1%)
112(0.4%)
61(0.2%)
31(0.1%)
27312 (100.0%) 0 (0.0%)
2 BORO [factor]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. STATEN ISLAND
7937(29.1%)
10933(40.0%)
3572(13.1%)
4094(15.0%)
776(2.8%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Given that the PRECINCT variable gives the same information as BORO (with a finest granularity), it is reasonable to aggregate the two factors, creating a variable giving more information compared with BORO with less levels compared to PRECINCT. The following images gives a visual illustration of the aggregation I applied.

NYC boroughs and precincts aggregated
NYC boroughs and precincts aggregated
generate_city_location <- function(){
  South_Manhattan <- c(1,5,6,7,9,10,13,14,17,18,19,22)
  North_Manhattan <- c(20, 23, 24, 25, 26, 28, 30, 32, 33, 34)
  
  West_Bronx <- c(40,41,42,44,46,48,50,52)
  East_Bronx <- c(43, 45, 47, 49)
  
  North_East_Queens <- c(103, 107, 109, 111)
  North_West_Queens <- c(104, 108, 110, 112, 114, 115)
  South_Queens <- c(100, 101, 102, 105, 106, 113)
  
  North_Brooklyn <- c(71, 76, 77, 78, 79, 81, 83, 84, 88, 90, 94)
  South_East_Brooklyn <- c(63, 67, 69, 73, 75)
  South_West_Brooklyn <- c(60, 61, 62, 66, 68, 70, 72)
  
  West_Staten_Island <- c(121, 123)
  East_Staten_Island <- c(120, 122)
  
  city_location <- rapply( as.list(as.numeric(as.character(shootings$PRECINCT))), function(x){
     
    y <- "Error"
    
    if (x %in% South_Manhattan){
      y <- "S_Manhattan"
    }
    
    if (x %in% North_Manhattan){
      y <- "N_Manhattan"
    }
    
    if (x %in% West_Bronx){
      y <- "W_Bronx"
    }
    
    if (x %in% East_Bronx){
      y <- "E_Bronx"
    }
    
    if (x %in% North_East_Queens){
      y <- "N_E_Queens"
    }
    
    if (x %in% North_West_Queens){
      y <- "N_W_Queens"
    }
    
    if (x %in% South_Queens){
      y <- "S_Queens"
    }
    
    if (x %in% North_Brooklyn){
      y <- "N_Brooklyn"
    }
    
    if (x %in% South_East_Brooklyn){
      y <- "S_E_Brooklyn"
    }
    
    if (x %in% South_West_Brooklyn){
      y <- "S_W_Brooklyn"
    }
    
    if (x %in% West_Staten_Island){
      y <- "W_Staten_Island"
    }
    
    if (x %in% East_Staten_Island){
      y <- "E_Staten_Island"
    }
    
    y
  })
}


shootings$city_location <- generate_city_location()

shootings$city_location <- as.factor(shootings$city_location)

shootings$city_location <- factor(shootings$city_location, levels=c('S_Manhattan', 'N_Manhattan', 'W_Bronx', 'E_Bronx', 'N_E_Queens', 'N_W_Queens', 'S_Queens', 'N_Brooklyn', 'S_E_Brooklyn', 'S_W_Brooklyn','W_Staten_Island', 'E_Staten_Island'))
shootings$BORO <- NULL
shootings$PRECINCT <- NULL
print(dfSummary(shootings[,"city_location"], max.distinct.values = 20), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
11. W_Staten_Island
12. E_Staten_Island
583(2.1%)
2989(10.9%)
5691(20.8%)
2246(8.2%)
820(3.0%)
900(3.3%)
2374(8.7%)
4719(17.3%)
4973(18.2%)
1241(4.5%)
143(0.5%)
633(2.3%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

3.5 Age data

The predictors PERP_AGE_GROUP and VIC_AGE_GROUP present some strange levels:

print(dfSummary(shootings[,c('PERP_AGE_GROUP', 'VIC_AGE_GROUP')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 PERP_AGE_GROUP [factor]
1. UNKNOWN
2. <18
3. 1020
4. 18-24
5. 224
6. 25-44
7. 45-64
8. 65+
9. 940
13132(48.1%)
1591(5.8%)
1(0.0%)
6222(22.8%)
1(0.0%)
5687(20.8%)
617(2.3%)
60(0.2%)
1(0.0%)
27312 (100.0%) 0 (0.0%)
2 VIC_AGE_GROUP [factor]
1. UNKNOWN
2. <18
3. 1022
4. 18-24
5. 25-44
6. 45-64
7. 65+
61(0.2%)
2839(10.4%)
1(0.0%)
10086(36.9%)
12281(45.0%)
1863(6.8%)
181(0.7%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

These levels contains all contain only one observation. It is reasonable to aggregate them with the “UNKNOWN” level. Furthermore, I aggregate the levels “45-64” and “65+, creating a new level.

levels(shootings$PERP_AGE_GROUP)
## [1] "UNKNOWN" "<18"     "1020"    "18-24"   "224"     "25-44"   "45-64"  
## [8] "65+"     "940"
levels(shootings$VIC_AGE_GROUP)
## [1] "UNKNOWN" "<18"     "1022"    "18-24"   "25-44"   "45-64"   "65+"
shootings$perp_age <- shootings$PERP_AGE_GROUP
levels(shootings$perp_age) <- c("UNKNOWN", "<18", "UNKNOWN", "18-24" , "UNKNOWN", "25-44", "45+", "45+",  "UNKNOWN")
shootings$PERP_AGE_GROUP <- NULL


shootings$vic_age <- shootings$VIC_AGE_GROUP
levels(shootings$vic_age) <- c("UNKNOWN", "<18", "UNKNOWN", "18-24", "25-44", "45+", "45+")
shootings$VIC_AGE_GROUP <- NULL
print(dfSummary(shootings[,c('perp_age', 'vic_age')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 perp_age [factor]
1. UNKNOWN
2. <18
3. 18-24
4. 25-44
5. 45+
13135(48.1%)
1591(5.8%)
6222(22.8%)
5687(20.8%)
677(2.5%)
27312 (100.0%) 0 (0.0%)
2 vic_age [factor]
1. UNKNOWN
2. <18
3. 18-24
4. 25-44
5. 45+
62(0.2%)
2839(10.4%)
10086(36.9%)
12281(45.0%)
2044(7.5%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

As we can see 48.1% of age data regarding the perpetrator is UNKNOWN, while only 0.2% of the data are UNKNOWN for the victim.

3.6 Sex data

print(dfSummary(shootings[,c('PERP_SEX', 'VIC_SEX')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 PERP_SEX [factor]
1. U
2. F
3. M
11449(41.9%)
424(1.6%)
15439(56.5%)
27312 (100.0%) 0 (0.0%)
2 VIC_SEX [factor]
1. U
2. F
3. M
11(0.0%)
2615(9.6%)
24686(90.4%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Let’s rename these variables.

shootings$perp_sex <- shootings$PERP_SEX
shootings$vic_sex <- shootings$VIC_SEX

shootings$PERP_SEX <- NULL
shootings$VIC_SEX <- NULL

As we can see 41.9% of age data regarding the perpetrator is UNKNOWN, while only 11 of the data are UNKNOWN for the victim.

3.7 Race data

print(dfSummary(shootings[,c('PERP_RACE', 'VIC_RACE')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 PERP_RACE [factor]
1. UNKNOWN
2. AMERICAN INDIAN/ALASKAN N
3. ASIAN / PACIFIC ISLANDER
4. BLACK
5. BLACK HISPANIC
6. WHITE
7. WHITE HISPANIC
11786(43.2%)
2(0.0%)
154(0.6%)
11432(41.9%)
1314(4.8%)
283(1.0%)
2341(8.6%)
27312 (100.0%) 0 (0.0%)
2 VIC_RACE [factor]
1. UNKNOWN
2. AMERICAN INDIAN/ALASKAN N
3. ASIAN / PACIFIC ISLANDER
4. BLACK
5. BLACK HISPANIC
6. WHITE
7. WHITE HISPANIC
66(0.2%)
10(0.0%)
404(1.5%)
19439(71.2%)
2646(9.7%)
698(2.6%)
4049(14.8%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

The level “AMERICAN INDIAN/ALASKAN N” has a very low frequency in the two variable, it is reasonable to aggregate it with level “UNKNOWN”. Furthermore I aggregate the levels “ASIAN / PACIFIC ISLANDER” and “WHITE” into level “ASIAN/WHITE”

levels(shootings$PERP_RACE)
## [1] "UNKNOWN"                        "AMERICAN INDIAN/ALASKAN NATIVE"
## [3] "ASIAN / PACIFIC ISLANDER"       "BLACK"                         
## [5] "BLACK HISPANIC"                 "WHITE"                         
## [7] "WHITE HISPANIC"
levels(shootings$VIC_RACE)
## [1] "UNKNOWN"                        "AMERICAN INDIAN/ALASKAN NATIVE"
## [3] "ASIAN / PACIFIC ISLANDER"       "BLACK"                         
## [5] "BLACK HISPANIC"                 "WHITE"                         
## [7] "WHITE HISPANIC"
shootings$perp_race <- shootings$PERP_RACE
levels(shootings$perp_race) <- c("UNKNOWN", "UNKNOWN", "ASIAN/WHITE", "BLACK", "BLACK HISPANIC", "ASIAN/WHITE", "WHITE HISPANIC")
shootings$PERP_RACE <- NULL

shootings$vic_race <- shootings$VIC_RACE
levels(shootings$vic_race) <- c("UNKNOWN", "UNKNOWN", "ASIAN/WHITE", "BLACK", "BLACK HISPANIC", "ASIAN/WHITE", "WHITE HISPANIC")
shootings$VIC_RACE <- NULL
print(dfSummary(shootings[,c('perp_race', 'vic_race')]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 perp_race [factor]
1. UNKNOWN
2. ASIAN/WHITE
3. BLACK
4. BLACK HISPANIC
5. WHITE HISPANIC
11788(43.2%)
437(1.6%)
11432(41.9%)
1314(4.8%)
2341(8.6%)
27312 (100.0%) 0 (0.0%)
2 vic_race [factor]
1. UNKNOWN
2. ASIAN/WHITE
3. BLACK
4. BLACK HISPANIC
5. WHITE HISPANIC
76(0.3%)
1102(4.0%)
19439(71.2%)
2646(9.7%)
4049(14.8%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

As we can see 43.2% of race data regarding the perpetrator is UNKNOWN, while only 0.3% of the data are UNKNOWN for the victim.

3.8 Other data

According to NYPD Shooting Incident Level Data Footnotes (pdf file in data folder): “A shooting incident can have multiple victims involved and as a result duplicate INCIDENT_KEY’s are produced. Each INCIDENT_KEY represents a victim but similar duplicate keys are counted as one incident.” Thus variable INCIDENT_KEY gives information about how many victims the specific shooting incident had. Let’s create a variable which contains this information.

count_key <- plyr::count(shootings, "INCIDENT_KEY")
count_key$other_victims <- count_key$freq - 1
count_key$freq <- NULL

shootings<-merge(shootings, count_key, by = "INCIDENT_KEY")

According to NYPD Shooting Incident Level Data Footnotes (pdf file in data folder) variable JURISDICTION_CODE identifies the Jurisdiction where the shooting incident occurred.

Jurisdiction codes are: 0 (Patrol), 1 (Transit) and 2 (Housing). Let’s transform numbers in characters.

levels(shootings$JURISDICTION_CODE) <-c("UNKNOWN", "PATROL", "TRANSIT", "HOUSING")
shootings$jurisdiction <- shootings$JURISDICTION_CODE
print(dfSummary(shootings$jurisdiction), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 jurisdiction [factor]
1. UNKNOWN
2. PATROL
3. TRANSIT
4. HOUSING
2(0.0%)
22809(83.5%)
74(0.3%)
4427(16.2%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Since the transit level is very low in frequency I aggregate it with UNKNOWN level.

levels(shootings$jurisdiction) <-c("UNKNOWN", "PATROL", "UNKNOWN", "HOUSING")
print(dfSummary(shootings$jurisdiction), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 jurisdiction [factor]
1. UNKNOWN
2. PATROL
3. HOUSING
76(0.3%)
22809(83.5%)
4427(16.2%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

3.9 Reordering columns

shootings_final <- shootings[, c( 
                            "day_period",#Time data
                            
                            "day_year",#Date data
                            "year",
                            "week_day",
                            "COVID_lockdown",
                            "COVID_pandemic",
                            
                            "working_hour", #Date and time data
                            
                            "Latitude", #Geographical data
                            "Longitude",
                            
                            "location_desc", #Location description data
                            
                            "city_location", #City location data
                            
                            "perp_age", #Age data
                            "vic_age",
                            
                            "perp_sex", #Sex data
                            "vic_sex",
                            
                            "perp_race", #Race data
                            "vic_race",
                            
                            "other_victims", #Other data
                            "jurisdiction",
                            
                            "murder", #response
                            "murder_prob" 
                            )]
shootings <- NULL

4 Handling Unknown data

Finally we need to decide what to do with the UNKNOWN levels.

print(dfSummary(shootings_final), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 day_period [factor]
1. EarlyMorning
2. Morning
3. EarlyAfternoon
4. Afternoon
5. Evening
6. Night
1539(5.6%)
1383(5.1%)
2287(8.4%)
4828(17.7%)
5818(21.3%)
11457(41.9%)
27312 (100.0%) 0 (0.0%)
2 day_year [numeric]
Mean (sd) : 192.5 (95.8)
min ≤ med ≤ max:
1 ≤ 195 ≤ 366
IQR (CV) : 145 (0.5)
366 distinct values 27312 (100.0%) 0 (0.0%)
3 year [numeric]
Mean (sd) : 2013.5 (5.1)
min ≤ med ≤ max:
2006 ≤ 2013 ≤ 2022
IQR (CV) : 9 (0)
17 distinct values 27312 (100.0%) 0 (0.0%)
4 week_day [factor]
1. Monday
2. Tuesday
3. Wednesday
4. Thursday
5. Friday
6. Saturday
7. Sunday
3883(14.2%)
3163(11.6%)
3000(11.0%)
3034(11.1%)
3585(13.1%)
5195(19.0%)
5452(20.0%)
27312 (100.0%) 0 (0.0%)
5 COVID_lockdown [factor]
1. FALSE
2. TRUE
26004(95.2%)
1308(4.8%)
27312 (100.0%) 0 (0.0%)
6 COVID_pandemic [factor]
1. FALSE
2. TRUE
21810(79.9%)
5502(20.1%)
27312 (100.0%) 0 (0.0%)
7 working_hour [factor]
1. FALSE
2. TRUE
23134(84.7%)
4178(15.3%)
27312 (100.0%) 0 (0.0%)
8 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
12618 distinct values 27312 (100.0%) 0 (0.0%)
9 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
12603 distinct values 27312 (100.0%) 0 (0.0%)
10 location_desc [factor]
1. UNKNOWN
2. COMMERCIAL BLDG
3. GROCERY/BODEGA
4. GYM/FITNESS FACILITY
5. JEWELRY STORE
6. LOAN COMPANY
7. VIDEO STORE
8. OTHER
15070(55.2%)
718(2.6%)
3717(13.6%)
4834(17.7%)
951(3.5%)
628(2.3%)
293(1.1%)
1101(4.0%)
27312 (100.0%) 0 (0.0%)
11 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
[ 2 others ]
583(2.1%)
2989(10.9%)
5691(20.8%)
2246(8.2%)
820(3.0%)
900(3.3%)
2374(8.7%)
4719(17.3%)
4973(18.2%)
1241(4.5%)
776(2.8%)
27312 (100.0%) 0 (0.0%)
12 perp_age [factor]
1. UNKNOWN
2. <18
3. 18-24
4. 25-44
5. 45+
13135(48.1%)
1591(5.8%)
6222(22.8%)
5687(20.8%)
677(2.5%)
27312 (100.0%) 0 (0.0%)
13 vic_age [factor]
1. UNKNOWN
2. <18
3. 18-24
4. 25-44
5. 45+
62(0.2%)
2839(10.4%)
10086(36.9%)
12281(45.0%)
2044(7.5%)
27312 (100.0%) 0 (0.0%)
14 perp_sex [factor]
1. U
2. F
3. M
11449(41.9%)
424(1.6%)
15439(56.5%)
27312 (100.0%) 0 (0.0%)
15 vic_sex [factor]
1. U
2. F
3. M
11(0.0%)
2615(9.6%)
24686(90.4%)
27312 (100.0%) 0 (0.0%)
16 perp_race [factor]
1. UNKNOWN
2. ASIAN/WHITE
3. BLACK
4. BLACK HISPANIC
5. WHITE HISPANIC
11788(43.2%)
437(1.6%)
11432(41.9%)
1314(4.8%)
2341(8.6%)
27312 (100.0%) 0 (0.0%)
17 vic_race [factor]
1. UNKNOWN
2. ASIAN/WHITE
3. BLACK
4. BLACK HISPANIC
5. WHITE HISPANIC
76(0.3%)
1102(4.0%)
19439(71.2%)
2646(9.7%)
4049(14.8%)
27312 (100.0%) 0 (0.0%)
18 other_victims [numeric]
Mean (sd) : 0.8 (1.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 17
IQR (CV) : 1 (2.1)
12 distinct values 27312 (100.0%) 0 (0.0%)
19 jurisdiction [factor]
1. UNKNOWN
2. PATROL
3. HOUSING
76(0.3%)
22809(83.5%)
4427(16.2%)
27312 (100.0%) 0 (0.0%)
20 murder [factor]
1. FALSE
2. TRUE
22046(80.7%)
5266(19.3%)
27312 (100.0%) 0 (0.0%)
21 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:22046(80.7%)
1:5266(19.3%)
27312 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

I decided to simply omit the rows that present UNKNOWN data for at least one of the following variables: vic_age, vic_sex, vic_race and jurisdiction since they all present a very low frequency of UNKNWON data.

removeUnknown <- function(df, names){
  for (name in names){
    levels(df[[name]])[levels(df[[name]])=="UNKNOWN"|levels(df[[name]])=="U"]<- NA 
  }
  
  df <- na.omit(df)
  droplevels(df)
}

shootings_final <- removeUnknown(shootings_final, c("vic_age", "vic_sex", "vic_race", "jurisdiction"))

print(dfSummary(shootings_final), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 day_period [factor]
1. EarlyMorning
2. Morning
3. EarlyAfternoon
4. Afternoon
5. Evening
6. Night
1512(5.6%)
1370(5.1%)
2260(8.3%)
4783(17.6%)
5797(21.4%)
11389(42.0%)
27111 (100.0%) 0 (0.0%)
2 day_year [numeric]
Mean (sd) : 192.4 (95.7)
min ≤ med ≤ max:
1 ≤ 195 ≤ 366
IQR (CV) : 145 (0.5)
366 distinct values 27111 (100.0%) 0 (0.0%)
3 year [numeric]
Mean (sd) : 2013.5 (5.1)
min ≤ med ≤ max:
2006 ≤ 2013 ≤ 2022
IQR (CV) : 9 (0)
17 distinct values 27111 (100.0%) 0 (0.0%)
4 week_day [factor]
1. Monday
2. Tuesday
3. Wednesday
4. Thursday
5. Friday
6. Saturday
7. Sunday
3842(14.2%)
3136(11.6%)
2977(11.0%)
3018(11.1%)
3560(13.1%)
5158(19.0%)
5420(20.0%)
27111 (100.0%) 0 (0.0%)
5 COVID_lockdown [factor]
1. FALSE
2. TRUE
25806(95.2%)
1305(4.8%)
27111 (100.0%) 0 (0.0%)
6 COVID_pandemic [factor]
1. FALSE
2. TRUE
21638(79.8%)
5473(20.2%)
27111 (100.0%) 0 (0.0%)
7 working_hour [factor]
1. FALSE
2. TRUE
22974(84.7%)
4137(15.3%)
27111 (100.0%) 0 (0.0%)
8 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
12547 distinct values 27111 (100.0%) 0 (0.0%)
9 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
12532 distinct values 27111 (100.0%) 0 (0.0%)
10 location_desc [factor]
1. UNKNOWN
2. COMMERCIAL BLDG
3. GROCERY/BODEGA
4. GYM/FITNESS FACILITY
5. JEWELRY STORE
6. LOAN COMPANY
7. VIDEO STORE
8. OTHER
14926(55.1%)
712(2.6%)
3703(13.7%)
4818(17.8%)
944(3.5%)
625(2.3%)
293(1.1%)
1090(4.0%)
27111 (100.0%) 0 (0.0%)
11 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
[ 2 others ]
567(2.1%)
2976(11.0%)
5655(20.9%)
2228(8.2%)
818(3.0%)
884(3.3%)
2359(8.7%)
4689(17.3%)
4937(18.2%)
1225(4.5%)
773(2.9%)
27111 (100.0%) 0 (0.0%)
12 perp_age [factor]
1. UNKNOWN
2. <18
3. 18-24
4. 25-44
5. 45+
13079(48.2%)
1577(5.8%)
6175(22.8%)
5622(20.7%)
658(2.4%)
27111 (100.0%) 0 (0.0%)
13 vic_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
2827(10.4%)
10025(37.0%)
12233(45.1%)
2026(7.5%)
27111 (100.0%) 0 (0.0%)
14 perp_sex [factor]
1. U
2. F
3. M
11400(42.0%)
420(1.5%)
15291(56.4%)
27111 (100.0%) 0 (0.0%)
15 vic_sex [factor]
1. F
2. M
2602(9.6%)
24509(90.4%)
27111 (100.0%) 0 (0.0%)
16 perp_race [factor]
1. UNKNOWN
2. ASIAN/WHITE
3. BLACK
4. BLACK HISPANIC
5. WHITE HISPANIC
11736(43.3%)
431(1.6%)
11334(41.8%)
1306(4.8%)
2304(8.5%)
27111 (100.0%) 0 (0.0%)
17 vic_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
1074(4.0%)
19380(71.5%)
2639(9.7%)
4018(14.8%)
27111 (100.0%) 0 (0.0%)
18 other_victims [numeric]
Mean (sd) : 0.8 (1.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 17
IQR (CV) : 1 (2.1)
12 distinct values 27111 (100.0%) 0 (0.0%)
19 jurisdiction [factor]
1. PATROL
2. HOUSING
22699(83.7%)
4412(16.3%)
27111 (100.0%) 0 (0.0%)
20 murder [factor]
1. FALSE
2. TRUE
21882(80.7%)
5229(19.3%)
27111 (100.0%) 0 (0.0%)
21 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:21882(80.7%)
1:5229(19.3%)
27111 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

4.1 Location description data

Since variable location_desc has a lot of UNKNOWN value, let’s investigate if those are “missing at random”. Let’s create a binary indicator which tells us when location description data are known or not:

shootings_final$known_location_desc <- as.factor(ifelse(shootings_final$location_desc=="UNKNOWN", "FALSE", "TRUE"))
ggplot(shootings_final %>% group_by(murder, known_location_desc)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=known_location_desc, y=ratio, x=murder)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.1L)), position=position_fill(vjust=0.5)) +
        labs(title="Known location description Ratio vs Murder", x="Murder", y="Known location description ratio", fill="Known location description")

As we can see, when the shooting incident is also a murder the location description is known more often. Maybe the data are more carefully entered when a shooting incident is also a murder.

ggplot(shootings_final %>% group_by(year, known_location_desc)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=known_location_desc, y=ratio, x=year)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.1L)), position=position_fill(vjust=0.5)) +
        scale_x_continuous(breaks=seq(min(shootings_final$year), max(shootings_final$year), 1)) +
        labs(title="Known location description Ratio vs Year", x="Year", y="Known location description Ratio",fill="Known location description")

As we can see, location description data are known almost always from year 2022. From 2006 to 2012 the situation was quite uniform, while from 2018 to 2021 the majority of location description are unknown. It seams like from 2022 there was a change in the data entry procedure for this variable, which minimized the UNKNOWN rows. This is a clear indication that this variable is not missing at random.

ggplot(shootings_final %>% group_by(jurisdiction, known_location_desc)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=known_location_desc, y=ratio, x=jurisdiction)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.1L)), position=position_fill(vjust=0.5)) +
        labs(title="Known location description Ratio vs Jurisdiction", x="Jurisdiction", y="Known location description Ratio",fill="Known location description")

As we can see, location description data are known almost always when jurisdiction is “hosing”. This is a another clear indication that this variable is not missing at random.

Finally we check the distribution of jurisdiction in year 2022.

print(dfSummary(shootings_final[shootings_final$year==2022, "jurisdiction"]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 jurisdiction [factor]
1. PATROL
2. HOUSING
1486(87.5%)
212(12.5%)
1698 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

print(dfSummary(shootings_final[, "jurisdiction"]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 jurisdiction [factor]
1. PATROL
2. HOUSING
22699(83.7%)
4412(16.3%)
27111 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

Which is more or less the same as the other year.

Another test that could be done is fit a logistic regression model to predict whenever the location description in KNOWN or not.

shootings_final$known_location_desc <- as.factor(ifelse(as.logical(shootings_final$known_location_desc),1,0))

I excluded from the fit location_desc which is clearly correlated, the perpetrator data which contains missing data and murder_prob which is nothing else but a copy of variable murder.

known_location_fit <- glm( known_location_desc ~ . -(location_desc) -(perp_race) -(perp_sex) -(perp_age) -(murder_prob), data = shootings_final, family = binomial)

summary(known_location_fit)
## 
## Call:
## glm(formula = known_location_desc ~ . - (location_desc) - (perp_race) - 
##     (perp_sex) - (perp_age) - (murder_prob), family = binomial, 
##     data = shootings_final)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -3.930e+01  6.513e+01  -0.603 0.546233    
## day_periodMorning             4.280e-02  9.567e-02   0.447 0.654614    
## day_periodEarlyAfternoon     -4.882e-02  8.649e-02  -0.564 0.572452    
## day_periodAfternoon          -2.583e-01  7.031e-02  -3.673 0.000240 ***
## day_periodEvening            -3.674e-01  6.716e-02  -5.470 4.49e-08 ***
## day_periodNight              -2.451e-01  6.228e-02  -3.935 8.31e-05 ***
## day_year                     -1.854e-03  1.509e-04 -12.288  < 2e-16 ***
## year                         -9.712e-02  4.390e-03 -22.123  < 2e-16 ***
## week_dayTuesday               4.142e-02  5.754e-02   0.720 0.471684    
## week_dayWednesday             1.932e-02  5.837e-02   0.331 0.740592    
## week_dayThursday             -7.591e-02  5.817e-02  -1.305 0.191893    
## week_dayFriday                4.693e-02  5.550e-02   0.846 0.397770    
## week_daySaturday             -8.037e-03  5.249e-02  -0.153 0.878319    
## week_daySunday               -4.591e-02  5.217e-02  -0.880 0.378786    
## COVID_lockdownTRUE           -1.699e+00  8.600e-02 -19.755  < 2e-16 ***
## COVID_pandemicTRUE            1.817e+00  5.899e-02  30.805  < 2e-16 ***
## working_hourTRUE              7.413e-02  6.079e-02   1.219 0.222674    
## Latitude                     -1.158e-01  6.699e-01  -0.173 0.862765    
## Longitude                    -3.238e+00  7.245e-01  -4.469 7.86e-06 ***
## city_locationN_Manhattan     -1.489e-01  1.271e-01  -1.171 0.241431    
## city_locationW_Bronx         -1.582e-02  1.368e-01  -0.116 0.907930    
## city_locationE_Bronx          2.378e-01  1.622e-01   1.466 0.142619    
## city_locationN_E_Queens       3.506e-01  1.935e-01   1.812 0.070014 .  
## city_locationN_W_Queens       2.457e-01  1.472e-01   1.668 0.095237 .  
## city_locationS_Queens         4.103e-01  2.022e-01   2.029 0.042420 *  
## city_locationN_Brooklyn      -1.082e-01  1.222e-01  -0.886 0.375885    
## city_locationS_E_Brooklyn     4.265e-02  1.378e-01   0.310 0.756881    
## city_locationS_W_Brooklyn    -2.856e-03  1.468e-01  -0.019 0.984481    
## city_locationW_Staten_Island -5.588e-01  2.577e-01  -2.168 0.030146 *  
## city_locationE_Staten_Island -3.825e-01  1.701e-01  -2.249 0.024540 *  
## vic_age18-24                  1.923e-01  5.322e-02   3.614 0.000302 ***
## vic_age25-44                  3.366e-01  5.258e-02   6.402 1.54e-10 ***
## vic_age45+                    6.160e-01  6.992e-02   8.810  < 2e-16 ***
## vic_sexM                     -3.360e-01  4.882e-02  -6.883 5.87e-12 ***
## vic_raceBLACK                -3.267e-01  7.071e-02  -4.619 3.85e-06 ***
## vic_raceBLACK HISPANIC       -2.786e-01  8.280e-02  -3.365 0.000766 ***
## vic_raceWHITE HISPANIC       -2.604e-01  7.797e-02  -3.340 0.000839 ***
## other_victims                 8.339e-02  8.353e-03   9.983  < 2e-16 ***
## jurisdictionHOUSING           5.013e+00  1.180e-01  42.471  < 2e-16 ***
## murderTRUE                    4.353e-01  3.581e-02  12.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37306  on 27110  degrees of freedom
## Residual deviance: 28082  on 27071  degrees of freedom
## AIC: 28162
## 
## Number of Fisher Scoring iterations: 6

As we can see a lot of predictors are significant.

We conclude by saying that this variable is clearly not missing at random. I decided to completely drop the location_desc variables.

shootings_final$location_desc <- NULL
shootings_final$known_location_desc <- NULL

4.2 Perpetrator data

As regarding to the perpetrator variables, I assumed that missing data in all the perpetrator field implies that the perpetrator was not known at the time of the incident. Thus I divided shootings_final in two datasets:

  • shootings_known: containing shootings with known perpetrator.

  • shootings_unknown: containing shootings with unknown perpetrator.

These two datasets will be analyzed separately.

4.2.1 Unknown perpetrator

shootings_unknown <- shootings_final[shootings_final$perp_sex=="U" & shootings_final$perp_age=="UNKNOWN" & shootings_final$perp_race=="UNKNOWN",]
shootings_unknown$perp_age <- NULL
shootings_unknown$perp_race <- NULL
shootings_unknown$perp_sex <- NULL

dim(shootings_unknown)
## [1] 11367    17
print(dfSummary(shootings_unknown), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 day_period [factor]
1. EarlyMorning
2. Morning
3. EarlyAfternoon
4. Afternoon
5. Evening
6. Night
611(5.4%)
430(3.8%)
691(6.1%)
1709(15.0%)
2516(22.1%)
5410(47.6%)
11367 (100.0%) 0 (0.0%)
2 day_year [numeric]
Mean (sd) : 199.6 (93.6)
min ≤ med ≤ max:
1 ≤ 203 ≤ 366
IQR (CV) : 133 (0.5)
366 distinct values 11367 (100.0%) 0 (0.0%)
3 year [numeric]
Mean (sd) : 2014.3 (4.8)
min ≤ med ≤ max:
2006 ≤ 2014 ≤ 2022
IQR (CV) : 9 (0)
17 distinct values 11367 (100.0%) 0 (0.0%)
4 week_day [factor]
1. Monday
2. Tuesday
3. Wednesday
4. Thursday
5. Friday
6. Saturday
7. Sunday
1691(14.9%)
1217(10.7%)
1138(10.0%)
1211(10.7%)
1448(12.7%)
2209(19.4%)
2453(21.6%)
11367 (100.0%) 0 (0.0%)
5 COVID_lockdown [factor]
1. FALSE
2. TRUE
10608(93.3%)
759(6.7%)
11367 (100.0%) 0 (0.0%)
6 COVID_pandemic [factor]
1. FALSE
2. TRUE
8671(76.3%)
2696(23.7%)
11367 (100.0%) 0 (0.0%)
7 working_hour [factor]
1. FALSE
2. TRUE
10083(88.7%)
1284(11.3%)
11367 (100.0%) 0 (0.0%)
8 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
6993 distinct values 11367 (100.0%) 0 (0.0%)
9 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
6985 distinct values 11367 (100.0%) 0 (0.0%)
10 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
[ 2 others ]
123(1.1%)
1153(10.1%)
2181(19.2%)
885(7.8%)
355(3.1%)
338(3.0%)
992(8.7%)
2241(19.7%)
2471(21.7%)
446(3.9%)
182(1.6%)
11367 (100.0%) 0 (0.0%)
11 vic_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
1073(9.4%)
4357(38.3%)
5200(45.7%)
737(6.5%)
11367 (100.0%) 0 (0.0%)
12 vic_sex [factor]
1. F
2. M
872(7.7%)
10495(92.3%)
11367 (100.0%) 0 (0.0%)
13 vic_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
304(2.7%)
8723(76.7%)
1013(8.9%)
1327(11.7%)
11367 (100.0%) 0 (0.0%)
14 other_victims [numeric]
Mean (sd) : 0.4 (1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 9
IQR (CV) : 0 (2.4)
0:8930(78.6%)
1:1428(12.6%)
2:194(1.7%)
3:619(5.4%)
4:5(0.0%)
5:150(1.3%)
6:7(0.1%)
7:24(0.2%)
9:10(0.1%)
11367 (100.0%) 0 (0.0%)
15 jurisdiction [factor]
1. PATROL
2. HOUSING
9383(82.5%)
1984(17.5%)
11367 (100.0%) 0 (0.0%)
16 murder [factor]
1. FALSE
2. TRUE
9481(83.4%)
1886(16.6%)
11367 (100.0%) 0 (0.0%)
17 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:9481(83.4%)
1:1886(16.6%)
11367 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

4.2.2 Known perpetrator

shootings_known <- shootings_final[shootings_final$perp_sex!="U" & shootings_final$perp_age!="UNKNOWN" & shootings_final$perp_race!="UNKNOWN",]

shootings_known <- droplevels(shootings_known)
dim(shootings_known)
## [1] 13873    20
print(dfSummary(shootings_known), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 day_period [factor]
1. EarlyMorning
2. Morning
3. EarlyAfternoon
4. Afternoon
5. Evening
6. Night
804(5.8%)
860(6.2%)
1429(10.3%)
2799(20.2%)
2872(20.7%)
5109(36.8%)
13873 (100.0%) 0 (0.0%)
2 day_year [numeric]
Mean (sd) : 187 (96.8)
min ≤ med ≤ max:
1 ≤ 188 ≤ 366
IQR (CV) : 150 (0.5)
366 distinct values 13873 (100.0%) 0 (0.0%)
3 year [numeric]
Mean (sd) : 2013.5 (5.2)
min ≤ med ≤ max:
2006 ≤ 2013 ≤ 2022
IQR (CV) : 9 (0)
17 distinct values 13873 (100.0%) 0 (0.0%)
4 week_day [factor]
1. Monday
2. Tuesday
3. Wednesday
4. Thursday
5. Friday
6. Saturday
7. Sunday
1902(13.7%)
1685(12.1%)
1635(11.8%)
1622(11.7%)
1889(13.6%)
2590(18.7%)
2550(18.4%)
13873 (100.0%) 0 (0.0%)
5 COVID_lockdown [factor]
1. FALSE
2. TRUE
13327(96.1%)
546(3.9%)
13873 (100.0%) 0 (0.0%)
6 COVID_pandemic [factor]
1. FALSE
2. TRUE
11096(80.0%)
2777(20.0%)
13873 (100.0%) 0 (0.0%)
7 working_hour [factor]
1. FALSE
2. TRUE
11262(81.2%)
2611(18.8%)
13873 (100.0%) 0 (0.0%)
8 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
7051 distinct values 13873 (100.0%) 0 (0.0%)
9 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
7047 distinct values 13873 (100.0%) 0 (0.0%)
10 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
[ 2 others ]
405(2.9%)
1626(11.7%)
3122(22.5%)
1182(8.5%)
392(2.8%)
493(3.6%)
1172(8.4%)
2105(15.2%)
2097(15.1%)
709(5.1%)
570(4.1%)
13873 (100.0%) 0 (0.0%)
11 perp_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
1560(11.2%)
6101(44.0%)
5559(40.1%)
653(4.7%)
13873 (100.0%) 0 (0.0%)
12 vic_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
1527(11.0%)
4881(35.2%)
6294(45.4%)
1171(8.4%)
13873 (100.0%) 0 (0.0%)
13 perp_sex [factor]
1. F
2. M
401(2.9%)
13472(97.1%)
13873 (100.0%) 0 (0.0%)
14 vic_sex [factor]
1. F
2. M
1597(11.5%)
12276(88.5%)
13873 (100.0%) 0 (0.0%)
15 perp_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
409(2.9%)
10078(72.6%)
1211(8.7%)
2175(15.7%)
13873 (100.0%) 0 (0.0%)
16 vic_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
686(4.9%)
9363(67.5%)
1448(10.4%)
2376(17.1%)
13873 (100.0%) 0 (0.0%)
17 other_victims [numeric]
Mean (sd) : 1.2 (2.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 17
IQR (CV) : 1 (1.8)
11 distinct values 13873 (100.0%) 0 (0.0%)
18 jurisdiction [factor]
1. PATROL
2. HOUSING
11671(84.1%)
2202(15.9%)
13873 (100.0%) 0 (0.0%)
19 murder [factor]
1. FALSE
2. TRUE
10561(76.1%)
3312(23.9%)
13873 (100.0%) 0 (0.0%)
20 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:10561(76.1%)
1:3312(23.9%)
13873 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

5 Data Visualization

5.1 Day period

group_by_IN <- function(df, variable){
  df %>% group_by(across(variable), murder)  %>% summarise(incident_number = n())
}

y_IN <- function(variable){
  c(0, max(group_by_IN(shootings_known, variable)$incident_number, group_by_IN(shootings_unknown, variable)$incident_number))
}
known_day_period_plot <- ggplot(shootings_known %>% group_by(day_period, murder)  %>% summarise(incident_number = n()), 
      aes(x=day_period, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.2, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Day period", y = "Incidents Number", fill = "Murder") +
      scale_y_continuous(limits = y_IN("day_period"))

unknown_day_period_plot <- ggplot(shootings_unknown %>% group_by(day_period, murder)  %>% summarise(incident_number = n()), 
      aes(x=day_period, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.2, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Day period", y = "Incidents Number", fill = "Murder") +
      scale_y_continuous(limits = y_IN("day_period"))

known_day_period_plot + unknown_day_period_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Day period')

Not surprisingly there are more shootings during the day with known perpetrator compared to unknown perpetrator; while during the night we have an opposite situation.

As we can see there is an increasing trend in the number of shooting during the day for both known and unknown perpetrator. For the unknown perpetrator there is a more significantly increase during the night. Let’s see if the murders increases as the same speed as non-murders:

known_day_period_plot_ratio <- ggplot(shootings_known %>% group_by(day_period, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=day_period)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator",
        x="Day period",
        y=" Murder Ratio",
        fill="Murder")

unknown_day_period_plot_ratio <- ggplot(shootings_unknown %>% group_by(day_period, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=day_period)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator",
        x="Day period",
        y="Murder Ratio",
        fill="Murder")

known_day_period_plot_ratio + unknown_day_period_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Day period')

As we can see the ratio of murders is decreasing, it seams that it is less likely to die during the night, despite the high number of shootings and more likely to die in the morning where the number of shootings is significantly lower. Furthermore for the known perpetrator there is a spike in murders in the evening.

5.2 Year

known_year_plot <- ggplot(data=shootings_known %>% group_by(year, murder)  %>% summarise(incident_number = n()), 
      aes(x=year, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      scale_x_continuous(breaks=seq(min(shootings_known$year), max(shootings_known$year), 2))+
      geom_text(aes(label=incident_number), vjust=3, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Year", y = "Incidents Number", fill = "Murder") +
      scale_y_continuous(limits = y_IN("year"))

unknown_year_plot <- ggplot(data=shootings_unknown %>% group_by(year, murder)  %>% summarise(incident_number = n()), 
      aes(x=year, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      scale_x_continuous(breaks=seq(min(shootings_unknown$year), max(shootings_unknown$year), 2))+
      geom_text(aes(label=incident_number), vjust=1, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Year", y = "Incidents Number", fill = "Murder") +
      scale_y_continuous(limits = y_IN("year"))

known_year_plot + unknown_year_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Year')

As we can see from 2006 to 2019 there is a decreasing thread with a strange spike in 2008 in the number of shootings with known perpetrator. In 2020, 2021 and 2022 the number of shooting with known perpetrator was the same as between 2006 and 2009. This corresponds to the COVID period.

As we can see from 2006 to 2011 there is a increasing thread and from 2012 to 2019 a decreasing trend in the number of shootings with unknown perpetrator. In 2020 and 2021 the number of shooting with unknown perpetrator was the highest since at least 2006. This corresponds to the COVID period. In 2022 the situation seams the same as the pre COVID situation.

Let’s investigate the murder ratio:

known_year_plot_ratio <- ggplot(shootings_known %>% group_by(year, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=year)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.1L)), position=position_fill(vjust=0.5)) +
        scale_x_continuous(breaks=seq(min(shootings_known$year), max(shootings_known$year), 2)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Year", y="Murder Ratio",fill="Murder")

unknown_year_plot_ratio <- ggplot(shootings_unknown %>% group_by(year, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=year)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.1L)), position=position_fill(vjust=0.5)) +
        scale_x_continuous(breaks=seq(min(shootings_unknown$year), max(shootings_unknown$year), 2)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Year", y="Murder Ratio",fill="Murder")

known_year_plot_ratio + unknown_year_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Rario vs Year')

As we can see the murder ratio fluctuates for both shooting with known perpetrator with sporadic spikes. There is a decreasing trend in murders between 2006 and 2011 for unknown perpetrator shootings.

5.3 Day of the year

known_day_year_plot <- ggplot(data=shootings_known %>% group_by(day_year, murder)  %>% summarise(incident_number = n()), 
      aes(x=day_year, y=incident_number, color=murder, fill = murder)) + geom_point() + 
      geom_line() +   
      geom_smooth(method="gam") +
      scale_x_continuous(breaks=seq(min(shootings_known$day_year), max(shootings_known$day_year), 20)) +
      scale_color_brewer(palette="Paired") +
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Day of the year", y = "Incidents Number", color = "Murder", fill="Murder")+
      scale_y_continuous(limits = y_IN("day_year"))

unknown_day_year_plot <- ggplot(data=shootings_unknown %>% group_by(day_year, murder)  %>% summarise(incident_number = n()), 
      aes(x=day_year, y=incident_number, color=murder, fill = murder)) + geom_point() + 
      geom_line() +   
      geom_smooth(method="gam") +
      scale_x_continuous(breaks=seq(min(shootings_unknown$day_year), max(shootings_unknown$day_year), 20)) +
      scale_color_brewer(palette="Paired") +
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Day of the year", y = "Incidents Number", color = "Murder", fill="Murder")+
      scale_y_continuous(limits = y_IN("day_year"))

known_day_year_plot+ unknown_day_year_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Day of the year')

As we can see, the number of shootings has a large fluctuation during the year. Furthermore there is an significant increasing trend in the non-murders during the summer and a sightly significant increasing trend during the summer for the murders. This in true for both known and unknown perpetrator. Let’s see the murder ratio:

to_plot_known <- shootings_known %>% group_by(day_year, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number))
to_plot_unknown <- shootings_unknown %>% group_by(day_year, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number))

known_day_year_plot_ratio <- ggplot(to_plot_known[to_plot_known$murder=="TRUE",],
        aes(y=ratio, x=day_year, color=ratio)) + geom_point() + geom_line() + geom_smooth(method="gam") +
        scale_x_continuous(breaks=seq(min(to_plot_known$day_year), max(to_plot_known$day_year), 20)) +
        labs(title="Known perpetrator", x="Day of the year", y="Murder Ratio",color="Murder Ratio") +
        scale_y_continuous(limits = c(0, 0.7))

unknown_day_year_plot_ratio <- ggplot(to_plot_unknown[to_plot_unknown$murder=="TRUE",],
        aes(y=ratio, x=day_year, color=ratio)) + geom_point() + geom_line() + geom_smooth(method="gam") +
        scale_x_continuous(breaks=seq(min(to_plot_unknown$day_year), max(to_plot_unknown$day_year), 20)) +
        labs(title="Unknown perpetrator", x="Day of the year", y="Murder Ratio",color="Murder Ratio") +
        scale_y_continuous(limits = c(0, 0.7))

known_day_year_plot_ratio + unknown_day_year_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Day of the year')

As we can see the number ratio decreases during summer as suggested from the previews plot: the increasing in non-murders is much more high compared to the increase in murders. Furthermore it seams like it is slight more likely to die during the last part of the year. This is true for both shootings with known and unknown perpetrator but for the latter is more significant.

Could be useful to analyse an interaction term between Year and Day of the year.

5.4 Week day

known_week_day_plot <- ggplot(data=shootings_known %>% group_by(week_day, murder)  %>% summarise(incident_number = n()), 
      aes(x=week_day, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=3, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Week Day", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("week_day"))

unknown_week_day_plot <- ggplot(data=shootings_unknown %>% group_by(week_day, murder)  %>% summarise(incident_number = n()), 
      aes(x=week_day, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=3, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Week Day", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("week_day"))

known_week_day_plot + unknown_week_day_plot + plot_layout(guides = "collect")+ plot_annotation(title = 'Incident Number vs Week day')

As we can see the number of shooting is much higher during the weekends and on Monday. This is true for both shootings with known and unknown perpetrator but in the latter is more significant. Let’s see the murder ratio:

known_week_day_plot_ratio <- ggplot(shootings_known %>% group_by(week_day, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=week_day)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Week Day", y="Murder Ratio",fill="Murder")

unknown_week_day_plot_ratio <- ggplot(shootings_unknown %>% group_by(week_day, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=week_day)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Week Day", y="Murder Ratio",fill="Murder")

known_week_day_plot_ratio + unknown_week_day_plot_ratio + plot_layout(guides = "collect")+ plot_annotation(title = 'Murder Ratio vs Week day')

As we can see, the murder ratio for shootings with known perpetrator stays quite stable during the week. This is true also for shootings with unknown perpetrator but we have a high spike on Wednesday and a lower spike on Saturday.

5.5 COVID lockdown

known_COVID_lockdown_plot <- ggplot(data=shootings_known %>% group_by(COVID_lockdown, murder)  %>% summarise(incident_number = n()), 
      aes(x=COVID_lockdown, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "COVID lockdown period", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("COVID_lockdown"))

unknown_COVID_lockdown_plot <- ggplot(data=shootings_unknown %>% group_by(COVID_lockdown, murder)  %>% summarise(incident_number = n()), 
      aes(x=COVID_lockdown, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "COVID lockdown period", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("COVID_lockdown"))

known_COVID_lockdown_plot + unknown_COVID_lockdown_plot + plot_layout(guides = "collect")+ plot_annotation(title = 'Incident Number vs COVID lockdown period')

This plot is not quite informative as the number of observation is high during the non-COVID lockdown period only because it covers a much long time period. Let’s see if there are any change in the murder ratio:

known_COVID_lockdown_plot_ratio <- ggplot(shootings_known %>% group_by(COVID_lockdown, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=COVID_lockdown)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Covid lockdown period", y="Murder Ratio",fill="Murder")

unknown_COVID_lockdown_plot_ratio <- ggplot(shootings_unknown %>% group_by(COVID_lockdown, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=COVID_lockdown)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Covid lockdown period", y="Murder Ratio",fill="Murder")

known_COVID_lockdown_plot_ratio + unknown_COVID_lockdown_plot_ratio + plot_layout(guides = "collect")+ plot_annotation(title = 'Murer ratio vs COVID lockdown period')

As we can see the COVID lockdown seams to have an opposite effect for shootings with known perpetrator and unknown perpetrator: for shooting with known perpetrator the murder ratio is slightly higher during COVID lockdown period, while for shootings with unknown perpetrator it is higher in non-COVID lockdown period.

5.6 COVID pandemic

known_COVID_pandemic_plot <- ggplot(data=shootings_known %>% group_by(COVID_pandemic, murder)  %>% summarise(incident_number = n()), 
      aes(x=COVID_pandemic, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "COVID pandemic period", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("COVID_pandemic"))

unknown_COVID_pandemic_plot <- ggplot(data=shootings_unknown %>% group_by(COVID_pandemic, murder)  %>% summarise(incident_number = n()), 
      aes(x=COVID_pandemic, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "COVID pandemic period", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("COVID_pandemic"))

known_COVID_pandemic_plot + unknown_COVID_pandemic_plot + plot_layout(guides = "collect")+ plot_annotation(title = 'Incident Number vs COVID pandemic period')

As before this plot is not quite informative as the number of observation is high during the non-COVID pandemic period only because it covers a much long time period. Let’s see if there are any change in the murder ratio:

known_COVID_pandemic_plot_ratio <- ggplot(shootings_known %>% group_by(COVID_pandemic, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=COVID_pandemic)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="COVID pandemic period", y="Murder Ratio",fill="Murder")

unknown_COVID_pandemic_plot_ratio <- ggplot(shootings_unknown %>% group_by(COVID_pandemic, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=COVID_pandemic)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="COVID pandemic period", y="Murder Ratio",fill="Murder")

known_COVID_pandemic_plot_ratio + unknown_COVID_pandemic_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs COVID pandemic period')

As we can see the COVID pandemic seams to have an opposite effect for shootings with known perpetrator and unknown perpetrator: for shooting with known perpetrator the murder ratio is slightly higher during COVID pandemic period, while for shootings with unknown perpetrator it is higher in non-COVID pandemic period.

5.7 Working Hour

known_working_hour_plot <- ggplot(data=shootings_known %>% group_by(working_hour, murder)  %>% summarise(incident_number = n()), 
      aes(x=working_hour, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "working hour", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("working_hour"))

unknown_working_hour_plot <- ggplot(data=shootings_unknown %>% group_by(working_hour, murder)  %>% summarise(incident_number = n()), 
      aes(x=working_hour, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "working hour", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("working_hour"))

known_working_hour_plot+unknown_working_hour_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Working hour')

As we can see, the number of shootings is higher in a working hour, this is simple because there are more working days than non-working days. This is true for both shootings with known and unknown perpetrator but this patter is more significant for unknown perpetrator shootings.

known_working_hour_plot_ratio <- ggplot(shootings_known %>% group_by(working_hour, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=working_hour)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Working hour", y="Murder Ratio",fill="Murder")

unknown_working_hour_plot_ratio <- ggplot(shootings_unknown %>% group_by(working_hour, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=working_hour)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Working hour", y="Murder Ratio",fill="Murder")

known_working_hour_plot_ratio + unknown_working_hour_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder ratio vs Working hour')

As we can see the working hour seams to have an opposite effect for shootings with known perpetrator and unknown perpetrator: for shooting with known perpetrator the murder ratio is slightly higher during a non-working hour, while for shootings with unknown perpetrator it is slightly higher in a working hour.

5.8 Geographical data

register_stadiamaps(key="e6b86eb9-3e44-40fa-81ba-b4911166205f", write = TRUE)

bbox_known <- make_bbox(Longitude, Latitude, data = shootings_known)
bbox_unknown <- make_bbox(Longitude, Latitude, data = shootings_unknown)

known_murder_map <- ggmap(get_stadiamap( bbox = bbox_known, maptype = "stamen_toner_lite", zoom = 10 , crop = FALSE)) + 
  geom_hdr(
    aes(Longitude, Latitude, fill = after_stat(probs)), data = shootings_known,
    alpha = .5
  ) + labs(title = "Known perpetrator", fill = "Shootings distrubution")

unknown_murder_map <- ggmap(get_stadiamap( bbox = bbox_unknown, maptype = "stamen_toner_lite", zoom = 10 , crop = FALSE)) + 
  geom_hdr(
    aes(Longitude, Latitude, fill = after_stat(probs)), data = shootings_unknown,
    alpha = .5
  ) + labs(title = "Unknown perpetrator", fill = "Shootings distrubution")

((known_murder_map + unknown_murder_map) + plot_layout(guides = "collect")+ plot_annotation(title = 'Shootings distrubution on map')) & theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank())

As we can see, it is less likely for a shootings to happen in inner regions of the city.

5.9 City Location data

known_city_location_plot <- ggplot(data=shootings_known %>% group_by(city_location, murder)  %>% summarise(incident_number = n()), 
      aes(x=city_location, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "City location", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("city_location"))

unknown_city_location_plot <- ggplot(data=shootings_unknown %>% group_by(city_location, murder)  %>% summarise(incident_number = n()), 
      aes(x=city_location, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "City location", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("city_location"))

(known_city_location_plot / unknown_city_location_plot) + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs City location')

As we can see from the plot the majority of the shootings happen in West Bronx, North Brooklyn and South-East Brooklyn and North Manhattan for both known and unknown perpetrator.

known_city_location_plot_ratio <- ggplot(data=shootings_known %>% group_by(city_location, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=city_location)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="City location", y="Murder Ratio",fill="Murder")

unknown_city_location_plot_ratio <- ggplot(data=shootings_unknown %>% group_by(city_location, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=city_location)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="City location", y="Murder Ratio",fill="Murder")

(known_city_location_plot_ratio / unknown_city_location_plot_ratio) + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs City location')

As always the murder ratio for unknown perpetrator shootings are lower compared to known perpetrator shootings but this trend is less significant for North-East Queens, North Brooklyn, South East Brooklyn and South West Brooklyn.

5.10 Victim age data

known_vic_age_plot <- ggplot(data=shootings_known %>% group_by(vic_age, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_age, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Victim age", y = "Incidents Number", fill = "Murder") +
      scale_y_continuous(limits = y_IN("vic_age"))

unknown_vic_age_plot <- ggplot(data=shootings_unknown %>% group_by(vic_age, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_age, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Victim age", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("vic_age"))

known_vic_age_plot + unknown_vic_age_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Victim age')

As we can see the majority of the victims are young adults. Let’s investigate murder ratio:

known_vic_age_plot_ratio <- ggplot(data=shootings_known %>% group_by(vic_age, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_age)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Victim age", y="Murder Ratio",fill="Murder")

unknown_vic_age_plot_ratio <- ggplot(data=shootings_unknown %>% group_by(vic_age, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_age)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Victim age", y="Murder Ratio",fill="Murder")

known_vic_age_plot_ratio + unknown_vic_age_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Victim age')

Not surprisingly, the older the victim is, the less changes of survival she/he has.

5.11 Victim sex data

known_vic_sex_plot <- ggplot(data=shootings_known %>% group_by(vic_sex, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_sex, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Victim sex", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("vic_sex"))

unknown_vic_sex_plot <- ggplot(data=shootings_unknown %>% group_by(vic_sex, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_sex, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Victim sex", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("vic_sex"))

known_vic_sex_plot + unknown_vic_sex_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Victim Sex')

As we can see the majority of the victims are male for both shootings with known and unknown perpetrator. Let’s investigate the murder ratio:

known_vic_sex_plot_ratio <- ggplot(data=shootings_known %>% group_by(vic_sex, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_sex)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Victim sex", y="Murder Ratio",fill="Murder")

unknown_vic_sex_plot_ratio <- ggplot(data=shootings_unknown %>% group_by(vic_sex, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_sex)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Victim sex", y="Murder Ratio",fill="Murder")

known_vic_sex_plot_ratio + unknown_vic_sex_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Victim sex')

As we can see the murder ratio for known perpetrator shootings is slightly higher when the victim is a female, while for unknown perpetrator shootings is higher when the victim is male.

5.12 Victim Race Data

known_vic_race_plot <- ggplot(data=shootings_known%>% group_by(vic_race, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_race, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Victim race", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("vic_race"))

unknown_vic_race_plot <- ggplot(data=shootings_unknown %>% group_by(vic_race, murder)  %>% summarise(incident_number = n()), 
      aes(x=vic_race, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Victim race", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("vic_race"))

known_vic_race_plot + unknown_vic_race_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incidents Number vs Victim race')

As we can see the majority of the shooting victims are black for both shooting with known and unknown perpetrator. Let’s investigate murder ratio:

known_vic_race_plot_ratio <-ggplot(data=shootings_known %>% group_by(vic_race, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_race)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="known perpetrator", x="Victim race", y="Murder Ratio",fill="Murder")

unknown_vic_race_plot_ratio <-ggplot(data=shootings_unknown %>% group_by(vic_race, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=vic_race)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="unknown perpetrator", x="Victim race", y="Murder Ratio",fill="Murder")

known_vic_race_plot_ratio + unknown_vic_race_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Victim race')

As we can see the murder ratio for a White and Asian victims is very high compared to the others. For unknown perpetrator shootings the murder ration is generally lower compared with shootings with known perpetrator.

5.13 Perpetrator age data

perp_age_plot <- ggplot(data=shootings_known %>% group_by(perp_age, murder)  %>% summarise(incident_number = n()), 
      aes(x=perp_age, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Incident Number vs Perpetrator age", x = "Perpetrator age", y = "Incidents Number", fill = "Murder")

perp_age_plot_ratio <- ggplot(data=shootings_known %>% group_by(perp_age, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=perp_age)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Murder Ratio vs Perpetrator age", x="Perpetrator age", y="Murder Ratio",fill="Murder")

perp_age_plot + perp_age_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Known perpetrator: Perpetrator age data')

As we can see, the majority of shootings are committed by young adults. Despite that, it seams like it is less likely for a young perpetrator to kill his victim: as we can see there is an increasing trend in murder ratio when the perpetrator age grows.

5.14 Perpetrator sex data

perp_sex_plot <- ggplot(data=shootings_known %>% group_by(perp_sex, murder)  %>% summarise(incident_number = n()), 
      aes(x=perp_sex, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Incidents Number vs Perpetrator sex", x = "Perpetrator sex", y = "Incidents Number", fill = "Murder")

perp_sex_plot_ratio <- ggplot(data=shootings_known %>% group_by(perp_sex, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=perp_sex)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Murder Ratio vs Perpetrator sex", x="Perpetrator sex", y="Murder Ratio",fill="Murder")

perp_sex_plot + perp_sex_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Known perpetrator: Perpetrator sex data')

As we can see, the majority of the shootings are committed by males. Despite that, it seams more likely for a female to kill her victim.

5.15 Perpetrator Race Data

perp_race_plot <- ggplot(data=shootings_known %>% group_by(perp_race, murder)  %>% summarise(incident_number = n()), 
      aes(x=perp_race, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Incident Number vs Perpetrator race", x = "Perpetrator race", y = "Incidents Number", fill = "Murder")

perp_race_plot_ratio <- ggplot(data=shootings_known %>% group_by(perp_race, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=perp_race)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Murder Ratio vs Perpetrator race", x="Perpetrator race", y="Murder Ratio",fill="Murder")

perp_race_plot + perp_race_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Known perpetrator: Perpetrator race data')

As we can see the majority of the shootings are committed by Black people. Despite that it is more likely for a Asian or White perpetrator to kill his victim.

5.16 Other victims

known_other_victims_plot <- ggplot(data=shootings_known %>% group_by(other_victims, murder)  %>% summarise(incident_number = n()), 
      aes(x=other_victims, y=incident_number, color=murder, fill = murder)) + geom_point() + 
      geom_line() +   
      scale_x_continuous(breaks=seq(min(shootings_known$other_victims), max(shootings_known$other_victims), 1)) +
      scale_color_brewer(palette="Paired") +
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Other victims", y = "Incidents Number", color = "Murder", fill="Murder")+
      scale_y_continuous(limits = y_IN("other_victims"))

unknown_other_victims_plot <- ggplot(data=shootings_unknown %>% group_by(other_victims, murder)  %>% summarise(incident_number = n()), 
      aes(x=other_victims, y=incident_number, color=murder, fill = murder)) + geom_point() + 
      geom_line() +   
      scale_x_continuous(breaks=seq(min(shootings_unknown$other_victims), max(shootings_unknown$other_victims), 1)) +
      scale_color_brewer(palette="Paired") +
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Other victims", y = "Incidents Number", color = "Murder", fill="Murder")+
      scale_y_continuous(limits = y_IN("other_victims"))

known_other_victims_plot + unknown_other_victims_plot + plot_layout(guides = "collect") + plot_annotation(title = 'Incident Number vs Other victims')

As we can see, the majority of the shootings have not additional victims for both shootings with known and unknown perpetrator. Furthermore we notice that shootings with unknown perpetrator with more than 9 additional victims are not present while for known perpetrator shootings are present shootings up to 17 additional victims!

to_plot_known <- shootings_known %>% group_by(other_victims, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number))

known_other_victims_plot_ratio <-ggplot(to_plot_known[to_plot_known$murder=="TRUE",],
        aes(y=ratio, x=other_victims, color=ratio)) + geom_point() + geom_line() +
        scale_x_continuous(breaks=seq(min(to_plot_known$other_victims), max(to_plot_known$other_victims), 1)) +
        labs(title="Known perpetrator", x="Other victims", y="Murder Ratio",color="")

to_plot_unknown <- shootings_unknown %>% group_by(other_victims, murder)  %>% summarise(incident_number = n()) %>% mutate(ratio=incident_number/sum(incident_number))

unknown_other_victims_plot_ratio <-ggplot(to_plot_unknown[to_plot_unknown$murder=="TRUE",],
        aes(y=ratio, x=other_victims, color=ratio)) + geom_point() + geom_line() +
        scale_x_continuous(breaks=seq(min(to_plot_known$other_victims), max(to_plot_known$other_victims), 1)) +
        labs(title="Unknown perpetrator", x="Other victims", y="Murder Ratio",color="")

known_other_victims_plot_ratio + unknown_other_victims_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = 'Murder Ratio vs Other victims')

The murder ratio tell us that if in the shooting incident with known perpetrator there are more than 2 other victims the murder ratio increases significantly. Furthermore if the number of victims increases but stays under 3 additional victims, the changes of survival increases. For some reason we have a low spike for 5 additional victims.

On the other hand, the murder ratio tell us that if in the shooting incident with unknown perpetrator there are more than one additional victim the murder ratio increase significantly. Furthermore if there is only one additional victim, the changes of survival increases.

5.17 Jurisdiction

known_jurisdiction_plot <- ggplot(data=shootings_known %>% group_by(jurisdiction, murder)  %>% summarise(incident_number = n()), 
      aes(x=jurisdiction, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Known perpetrator", x = "Jurisdiction", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("jurisdiction"))

unknown_jurisdiction_plot <- ggplot(data=shootings_unknown %>% group_by(jurisdiction, murder)  %>% summarise(incident_number = n()), 
      aes(x=jurisdiction, y=incident_number, fill=murder)) + geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=0, color="black", position = position_dodge(0.9), size=3) + 
      scale_fill_brewer(palette="Paired") +
      labs(title = "Unknown perpetrator", x = "Jurisdiction", y = "Incidents Number", fill = "Murder")+
      scale_y_continuous(limits = y_IN("jurisdiction"))

known_jurisdiction_plot + unknown_jurisdiction_plot + plot_layout(guides = "collect") + plot_annotation(title = "Incidents Number vs Jurisdiction")

As we can see the majority of the shootings are committed under PATROL jurisdiction for both shootings with known and unknown perpetrator.

known_jurisdiction_plot_ratio <- ggplot(data=shootings_known %>% group_by(jurisdiction, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=jurisdiction)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Known perpetrator", x="Jurisdiction", y="Murder Ratio",fill="Murder")

unknown_jurisdiction_plot_ratio <- ggplot(data=shootings_unknown %>% group_by(jurisdiction, murder)  %>% summarise(incident_number = n()) %>%
        mutate(ratio=incident_number/sum(incident_number)),
        aes(fill=murder, y=ratio, x=jurisdiction)) + 
        geom_bar(position="fill", stat="identity") + 
        geom_text(aes(label=scales::percent(ratio, accuracy = 0.01L)), position=position_fill(vjust=0.5)) +
        scale_fill_brewer(palette="Paired") +
        labs(title="Unknown perpetrator", x="Jurisdiction", y="Murder Ratio",fill="Murder")

known_jurisdiction_plot_ratio + unknown_jurisdiction_plot_ratio + plot_layout(guides = "collect") + plot_annotation(title = "Murder Ratio vs Jurisdiction")

Furthermore we notice that the murder ratio is lower in HOUSING jurisdiction for both shootings with known and unknown perpetrator.

6 Models

See:

  • “2_known_perp_analysis”: known perpetrator analysis.

  • “3_unknown_perp_analysis”: unknown perpetrator analysis.

saveRDS(shootings_known,file="data/shootings_known.Rda")
saveRDS(shootings_unknown,file="data/shootings_unknown.Rda")